首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: An important goal in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Various parametric tests, such as the two-sample t-test, have been used, but their possibly too strong parametric assumptions or large sample justifications may not hold in practice. As alternatives, a class of three nonparametric statistical methods, including the empirical Bayes method of Efron et al. (2001), the significance analysis of microarray (SAM) method of Tusher et al. (2001) and the mixture model method (MMM) of Pan et al. (2001), have been proposed. All the three methods depend on constructing a test statistic and a so-called null statistic such that the null statistic's distribution can be used to approximate the null distribution of the test statistic. However, relatively little effort has been directed toward assessment of the performance or the underlying assumptions of the methods in constructing such test and null statistics. RESULTS: We point out a problem of a current method to construct the test and null statistics, which may lead to largely inflated Type I errors (i.e. false positives). We also propose two modifications that overcome the problem. In the context of MMM, the improved performance of the modified methods is demonstrated using simulated data. In addition, our numerical results also provide evidence to support the utility and effectiveness of MMM.  相似文献   

2.
DNA microarray experiments have generated large amount of gene expression measurements across different conditions. One crucial step in the analysis of these data is to detect differentially expressed genes. Some parametric methods, including the two-sample t-test (T-test) and variations of it, have been used. Alternatively, a class of non-parametric algorithms, such as the Wilcoxon rank sum test (WRST), significance analysis of microarrays (SAM) of Tusher et al. (2001), the empirical Bayesian (EB) method of Efron et al. (2001), etc., have been proposed. Most available popular methods are based on t-statistic. Due to the quality of the statistic that they used to describe the difference between groups of data, there are situations when these methods are inefficient, especially when the data follows multi-modal distributions. For example, some genes may display different expression patterns in the same cell type, say, tumor or normal, to form some subtypes. Most available methods are likely to miss these genes. We developed a new non-parametric method for selecting differentially expressed genes by relative entropy, called SDEGRE, to detect differentially expressed genes by combining relative entropy and kernel density estimation, which can detect all types of differences between two groups of samples. The significance of whether a gene is differentially expressed or not can be estimated by resampling-based permutations. We illustrate our method on two data sets from Golub et al. (1999) and Alon et al. (1999). Comparing the results with those of the T-test, the WRST and the SAM, we identified novel differentially expressed genes which are of biological significance through previous biological studies while they were not detected by the other three methods. The results also show that the genes selected by SDEGRE have a better capability to distinguish the two cell types.  相似文献   

3.
Detection of functional modules from protein interaction networks   总被引:4,自引:0,他引:4  
  相似文献   

4.
A Bayesian model-based clustering approach is proposed for identifying differentially expressed genes in meta-analysis. A Bayesian hierarchical model is used as a scientific tool for combining information from different studies, and a mixture prior is used to separate differentially expressed genes from non-differentially expressed genes. Posterior estimation of the parameters and missing observations are done by using a simple Markov chain Monte Carlo method. From the estimated mixture model, useful measure of significance of a test such as the Bayesian false discovery rate (FDR), the local FDR (Efron et al., 2001), and the integration-driven discovery rate (IDR; Choi et al., 2003) can be easily computed. The model-based approach is also compared with commonly used permutation methods, and it is shown that the model-based approach is superior to the permutation methods when there are excessive under-expressed genes compared to over-expressed genes or vice versa. The proposed method is applied to four publicly available prostate cancer gene expression data sets and simulated data sets.  相似文献   

5.
The US National Cancer Institute has recently sponsored the formation of a Cohort Consortium (http://2002.cancer.gov/scpgenes.htm) to facilitate the pooling of data on very large numbers of people, concerning the effects of genes and environment on cancer incidence. One likely goal of these efforts will be generate a large population-based case-control series for which a number of candidate genes will be investigated using SNP haplotype as well as genotype analysis. The goal of this paper is to outline the issues involved in choosing a method of estimating haplotype-specific risk estimates for such data that is technically appropriate and yet attractive to epidemiologists who are already comfortable with odds ratios and logistic regression. Our interest is to develop and evaluate extensions of methods, based on haplotype imputation, that have been recently described (Schaid et al., Am J Hum Genet, 2002, and Zaykin et al., Hum Hered, 2002) as providing score tests of the null hypothesis of no effect of SNP haplotypes upon risk, which may be used for more complex tasks, such as providing confidence intervals, and tests of equivalence of haplotype-specific risks in two or more separate populations. In order to do so we (1) develop a cohort approach towards odds ratio analysis by expanding the E-M algorithm to provide maximum likelihood estimates of haplotype-specific odds ratios as well as genotype frequencies; (2) show how to correct the cohort approach, to give essentially unbiased estimates for population-based or nested case-control studies by incorporating the probability of selection as a case or control into the likelihood, based on a simplified model of case and control selection, and (3) finally, in an example data set (CYP17 and breast cancer, from the Multiethnic Cohort Study) we compare likelihood-based confidence interval estimates from the two methods with each other, and with the use of the single-imputation approach of Zaykin et al. applied under both null and alternative hypotheses. We conclude that so long as haplotypes are well predicted by SNP genotypes (we use the Rh2 criteria of Stram et al. [1]) the differences between the three methods are very small and in particular that the single imputation method may be expected to work extremely well.  相似文献   

6.
Mixture modelling of gene expression data from microarray experiments   总被引:5,自引:0,他引:5  
MOTIVATION: Hierarchical clustering is one of the major analytical tools for gene expression data from microarray experiments. A major problem in the interpretation of the output from these procedures is assessing the reliability of the clustering results. We address this issue by developing a mixture model-based approach for the analysis of microarray data. Within this framework, we present novel algorithms for clustering genes and samples. One of the byproducts of our method is a probabilistic measure for the number of true clusters in the data. RESULTS: The proposed methods are illustrated by application to microarray datasets from two cancer studies; one in which malignant melanoma is profiled (Bittner et al., Nature, 406, 536-540, 2000), and the other in which prostate cancer is profiled (Dhanasekaran et al., 2001, submitted).  相似文献   

7.
Dahm PF  Olmsted AW  Greenbaum IF 《Biometrics》2002,58(4):1028-1031
Summary. Böhm et al. (1995, Human Genetics 95 , 249–256) introduced a statistical model (named FSM–fragile site model) specifically designed for the identification of fragile sites from chromosomal breakage data. In response to claims to the contrary (Hou et al., 1999, Human Genetics 104 , 350–355; Hou et al., 2001, Biometrics 57 , 435–440), we show how the FSM model is correctly modified for application under the assumption that the probability of random breakage is proportional to chromosomal band length and how the purportedly alternative procedures proposed by Hou, Chang, and Tai (1999, 2001) are variations of the correctly modified FSM algorithm. With the exception of the test statistic employed, the procedure described by Hou et al. (1999) is shown to be functionally identical to the correctly modified FSM and the application of an incorrectly modified FSM is shown to invalidate all of the comparisons of FSM to the alternatives proposed by Hou et al. (1999, 2001). Last, we discuss the statistical implications of the methodological variations proposed by Hou et al. (2001) and emphasize the logical and statistical necessity for fragile site identifications to be based on data from single individuals.  相似文献   

8.
The transmission disequilibrium test (TDT) has been utilized to test the linkage and association between a genetic trait locus and a marker. Spielman et al. (1993) introduced TDT to test linkage between a qualitative trait and a marker in the presence of association. In the presence of linkage, TDT can be applied to test for association for fine mapping (Martin et al., 1997; Spielman and Ewens, 1996). In recent years, extensive research has been carried out on the TDT between a quantitative trait and a marker locus (Allison, 1997; Fan et al., 2002; George et al., 1999; Rabinowitz, 1997; Xiong et al., 1998; Zhu and Elston, 2000, 2001). The original TDT for both qualitative and quantitative traits requires unrelated offspring of heterozygous parents for analysis, and much research has been carried out to extend it to fit for different settings. For nuclear families with multiple offspring, one approach is to treat each child independently for analysis. Obviously, this may not be a valid method since offspring of one family are related to each other. Another approach is to select one offspring randomly from each family for analysis. However, with this method much information may be lost. Martin et al. (1997, 2000) constructed useful statistical tests to analyse the data for qualitative traits. In this paper, we propose to use mixed models to analyse sample data of nuclear families with multiple offspring for quantitative traits according to the models in Amos (1994). The method uses data of all offspring by taking into account their trait mean and variance-covariance structures, which contain all the effects of major gene locus, polygenic loci and environment. A test statistic based on mixed models is shown to be more powerful than the test statistic proposed by George et al. (1999) under moderate disequilibrium for nuclear families. Moreover, it has higher power than the TDT statistic which is constructed by randomly choosing a single offspring from each nuclear family.  相似文献   

9.
基因芯片筛选差异表达基因方法比较   总被引:1,自引:0,他引:1  
单文娟  童春发  施季森 《遗传》2008,30(12):1640-1646
摘要: 使用计算机模拟数据和真实的芯片数据, 对8种筛选差异表达基因的方法进行了比较分析, 旨在比较不同方法对基因芯片数据的筛选效果。模拟数据分析表明, 所使用的8种方法对均匀分布的差异表达基因有很好的识别、检出作用。算法方面, SAM和Wilcoxon秩和检验方法较好; 数据分布方面, 正态分布的识别效果较好, 卡方分布和指数分布的识别效果较差。杨树cDNA芯片分析表明, SAM、Samroc和回归模型方法相近, 而Wilcoxon秩和检验方法与它们有较大差异。  相似文献   

10.
MOTIVATION: We introduce simple graphical classification and prediction tools for tumor status using gene-expression profiles. They are based on two dimension estimation techniques sliced average variance estimation (SAVE) and sliced inverse regression (SIR). Both SAVE and SIR are used to infer on the dimension of the classification problem and obtain linear combinations of genes that contain sufficient information to predict class membership, such as tumor type. Plots of the estimated directions as well as numerical thresholds estimated from the plots are used to predict tumor classes in cDNA microarrays and the performance of the class predictors is assessed by cross-validation. A microarray simulation study is carried out to compare the power and predictive accuracy of the two methods. RESULTS: The methods are applied to cDNA microarray data on BRCA1 and BRCA2 mutation carriers as well as sporadic tumors from Hedenfalk et al. (2001). All samples are correctly classified.  相似文献   

11.
12.
Prostate cancer is one of the most common malignancies.The development and progression of prostate cancer are driven by a series of genetic and epigenetic events including gene amplification that activates oncogenes and chromosomal deletion that inactivates tumor suppressor genes.Whereas gene amplification occurs in human prostate cancer,gene deletion is more common,and a large number of chromosomal regions have been identified to have frequent deletion in prostate cancer,suggesting that tumor suppressor inactivation is more common than oncogene activation in prostatic carcinogenesis (Knuutila et al.,1998,1999;Dong,2001).Among the most frequently deleted chromosomal regions in prostate cancer,target genes such as NKX3-1 from 8p21,PTENfrom 10q23 andATBF1 from 16q22 have been identified by different approaches (He et al.,1997;Li et al.,1997;Sun et al.,2005),and deletion of these genes in mouse prostates has been demonstrated to induce and/or promote prostatic carcinogenesis.For example,knockout of Nkx3-1 in mice induces hyperplasia and dysplasia (Bhatia-Gaur et al.,1999;Abdulkadir et al.,2002) and promotes prostatic tumorigenesis (Abate-Shen et al.,2003),while knockout of Pten alone causes prostatic neoplasia (Wang et al.,2003).Therefore,gene deletion plays a causal role in prostatic carcinogenesis (Dong,2001).  相似文献   

13.
Sun L  Kim YJ  Sun J 《Biometrics》2004,60(3):637-643
Doubly censored failure time data arise when the survival time of interest is the elapsed time between two related events and observations on occurrences of both events could be censored. Regression analysis of doubly censored data has recently attracted considerable attention and for this a few methods have been proposed (Kim et al., 1993, Biometrics 49, 13-22; Sun et al., 1999, Biometrics 55, 909-914; Pan, 2001, Biometrics 57, 1245-1250). However, all of the methods are based on the proportional hazards model and it is well known that the proportional hazards model may not fit failure time data well sometimes. This article investigates regression analysis of such data using the additive hazards model and an estimating equation approach is proposed for inference about regression parameters of interest. The proposed method can be easily implemented and the properties of the proposed estimates of regression parameters are established. The method is applied to a set of doubly censored data from an AIDS cohort study.  相似文献   

14.
The widely used “Maxent” software for modeling species distributions from presence‐only data (Phillips et al., Ecological Modelling, 190, 2006, 231) tends to produce models with high‐predictive performance but low‐ecological interpretability, and implications of Maxent's statistical approach to variable transformation, model fitting, and model selection remain underappreciated. In particular, Maxent's approach to model selection through lasso regularization has been shown to give less parsimonious distribution models—that is, models which are more complex but not necessarily predictively better—than subset selection. In this paper, we introduce the MIAmaxent R package, which provides a statistical approach to modeling species distributions similar to Maxent's, but with subset selection instead of lasso regularization. The simpler models typically produced by subset selection are ecologically more interpretable, and making distribution models more grounded in ecological theory is a fundamental motivation for using MIAmaxent. To that end, the package executes variable transformation based on expected occurrence–environment relationships and contains tools for exploring data and interrogating models in light of knowledge of the modeled system. Additionally, MIAmaxent implements two different kinds of model fitting: maximum entropy fitting for presence‐only data and logistic regression (GLM) for presence–absence data. Unlike Maxent, MIAmaxent decouples variable transformation, model fitting, and model selection, which facilitates methodological comparisons and gives the modeler greater flexibility when choosing a statistical approach to a given distribution modeling problem.  相似文献   

15.
Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.  相似文献   

16.
17.
Summary Ye, Lin, and Taylor (2008, Biometrics 64 , 1238–1246) proposed a joint model for longitudinal measurements and time‐to‐event data in which the longitudinal measurements are modeled with a semiparametric mixed model to allow for the complex patterns in longitudinal biomarker data. They proposed a two‐stage regression calibration approach that is simpler to implement than a joint modeling approach. In the first stage of their approach, the mixed model is fit without regard to the time‐to‐event data. In the second stage, the posterior expectation of an individual's random effects from the mixed‐model are included as covariates in a Cox model. Although Ye et al. (2008) acknowledged that their regression calibration approach may cause a bias due to the problem of informative dropout and measurement error, they argued that the bias is small relative to alternative methods. In this article, we show that this bias may be substantial. We show how to alleviate much of this bias with an alternative regression calibration approach that can be applied for both discrete and continuous time‐to‐event data. Through simulations, the proposed approach is shown to have substantially less bias than the regression calibration approach proposed by Ye et al. (2008) . In agreement with the methodology proposed by Ye et al. (2008) , an advantage of our proposed approach over joint modeling is that it can be implemented with standard statistical software and does not require complex estimation techniques.  相似文献   

18.
MOTIVATION: High-density DNA microarray measures the activities of several thousand genes simultaneously and the gene expression profiles have been used for the cancer classification recently. This new approach promises to give better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer diagnosis problems. However, its optimal extension to more than two classes was not obvious, which might impose limitations in its application to multiple tumor types. We briefly introduce the Multicategory SVM, which is a recently proposed extension of the binary SVM, and apply it to multiclass cancer diagnosis problems. RESULTS: Its applicability is demonstrated on the leukemia data (Golub et al., 1999) and the small round blue cell tumors of childhood data (Khan et al., 2001). Comparable classification accuracy shown in the applications and its flexibility render the MSVM a viable alternative to other classification methods. SUPPLEMENTARY INFORMATION: http://www.stat.ohio-state.edu/~yklee/msvm.htm  相似文献   

19.
Horizontal gene transfer in microbial genome evolution   总被引:1,自引:0,他引:1  
Horizontal gene transfer is the collective name for processes that permit the exchange of DNA among organisms of different species. Only recently has it been recognized as a significant contribution to inter-organismal gene exchange. Traditionally, it was thought that microorganisms evolved clonally, passing genes from mother to daughter cells with little or no exchange of DNA among diverse species. Studies of microbial genomes, however, have shown that genomes contain genes that are closely related to a number of different prokaryotes, sometimes to phylogenetically very distantly related ones. (Doolittle et al., 1990, J. Mol. Evol. 31, 383-388; Karlin et al., 1997, J. Bacteriol. 179, 3899-3913; Karlin et al., 1998, Annu. Rev. Genet. 32, 185-225; Lawrence and Ochman, 1998, Proc. Natl. Acad. Sci. USA 95, 9413-9417; Rivera et al., 1998, Proc. Natl. Acad. Sci. USA 95, 6239-6244; Campbell, 2000, Theor. Popul. Biol. 57 71-77; Doolittle, 2000, Sci. Am. 282, 90-95; Ochman and Jones, 2000, Embo. J. 19, 6637-6643; Boucher et al. 2001, Curr. Opin., Microbiol. 4, 285-289; Wang et al., 2001, Mol. Biol. Evol. 18, 792-800). Whereas prokaryotic and eukaryotic evolution was once reconstructed from a single 16S ribosomal RNA (rRNA) gene, the analysis of complete genomes is beginning to yield a different picture of microbial evolution, one that is wrought with the lateral movement of genes across vast phylogenetic distances. (Lane et al., 1988, Methods Enzymol. 167, 138-144; Lake and Rivera, 1996, Proc. Natl. Acad. Sci. USA 91, 2880-2881; Lake et al., 1999, Science 283, 2027-2028).  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号