首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Microarray experiments can generate enormous amounts of data, but large datasets are usually inherently complex, and the relevant information they contain can be difficult to extract. For the practicing biologist, we provide an overview of what we believe to be the most important issues that need to be addressed when dealing with microarray data. In a microarray experiment we are simply trying to identify which genes are the most "interesting" in terms of our experimental question, and these will usually be those that are either overexpressed or underexpressed (upregulated or downregulated) under the experimental conditions. Analysis of the data to find these genes involves first preprocessing of the raw data for quality control, including filtering of the data (e.g., detection of outlying values) followed by standardization of the data (i.e., making the data uniformly comparable throughout the dataset). This is followed by the formal quantitative analysis of the data, which will involve either statistical hypothesis testing or multivariate pattern recognition. Statistical hypothesis testing is the usual approach to "class comparison," where several experimental groups are being directly compared. The best approach to this problem is to use analysis of variance, although issues related to multiple hypothesis testing and probability estimation still need to be evaluated. Pattern recognition can involve "class prediction," for which a range of supervised multivariate techniques are available, or "class discovery," for which an even broader range of unsupervised multivariate techniques have been developed. Each technique has its own limitations, which need to be kept in mind when making a choice from among them. To put these ideas in context, we provide a detailed examination of two specific examples of the analysis of microarray data, both from parasitology, covering many of the most important points raised.  相似文献   

2.
MOTIVATION: Currently most of the methods for identifying differentially expressed genes fall into the category of so called single-gene-analysis, performing hypothesis testing on a gene-by-gene basis. In a single-gene-analysis approach, estimating the variability of each gene is required to determine whether a gene is differentially expressed or not. Poor accuracy of variability estimation makes it difficult to identify genes with small fold-changes unless a very large number of replicate experiments are performed. RESULTS: We propose a method that can avoid the difficult task of estimating variability for each gene, while reliably identifying a group of differentially expressed genes with low false discovery rates, even when the fold-changes are very small. In this article, a new characterization of differentially expressed genes is established based on a theorem about the distribution of ranks of genes sorted by (log) ratios within each array. This characterization of differentially expressed genes based on rank is an example of all-gene-analysis instead of single gene analysis. We apply the method to a cDNA microarray dataset and many low fold-changed genes (as low as 1.3 fold-changes) are reliably identified without carrying out hypothesis testing on a gene-by-gene basis. The false discovery rate is estimated in two different ways reflecting the variability from all the genes without the complications related to multiple hypothesis testing. We also provide some comparisons between our approach and single-gene-analysis based methods. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

3.
Finding edging genes from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: A set of genes and their gene expression levels are used to classify disease and normal tissues. Due to the massive number of genes in microarray, there are a large number of edges to divide different classes of genes in microarray space. The edging genes (EGs) can be co-regulated genes, they can also be on the same pathway or deregulated by the same non-coding genes, such as siRNA or miRNA. Every gene in EGs is vital for identifying a tissue's class. The changing in one EG's gene expression may cause a tissue alteration from normal to disease and vice versa. Finding EGs is of biological importance. In this work, we propose an algorithm to effectively find these EGs. RESULT: We tested our algorithm with five microarray datasets. The results are compared with the border-based algorithm which was used to find gene groups and subsequently divide different classes of tissues. Our algorithm finds a significantly larger amount of EGs than does the border-based algorithm. As our algorithm prunes irrelevant patterns at earlier stages, time and space complexities are much less prevalent than in the border-based algorithm. AVAILABILITY: The algorithm proposed is implemented in C++ on Linux platform. The EGs in five microarray datasets are calculated. The preprocessed datasets and the discovered EGs are available at http://www3.it.deakin.edu.au/~phoebe/microarray.html.  相似文献   

4.
Significance of gene ranking for classification of microarray samples   总被引:1,自引:0,他引:1  
Many methods for classification and gene selection with microarray data have been developed. These methods usually give a ranking of genes. Evaluating the statistical significance of the gene ranking is important for understanding the results and for further biological investigations, but this question has not been well addressed for machine learning methods in existing works. Here, we address this problem by formulating it in the framework of hypothesis testing and propose a solution based on resampling. The proposed r-test methods convert gene ranking results into position p-values to evaluate the significance of genes. The methods are tested on three real microarray data sets and three simulation data sets with support vector machines as the method of classification and gene selection. The obtained position p-values help to determine the number of genes to be selected and enable scientists to analyze selection results by sophisticated multivariate methods under the same statistical inference paradigm as for simple hypothesis testing methods.  相似文献   

5.
Screening for differential gene expression in microarray studies leads to difficult large-scale multiple testing problems. The local false discovery rate is a statistical concept for quantifying uncertainty in multiple testing. We introduce a novel estimator for the local false discovery rate that is based on an algorithm which splits all genes into two groups, representing induced and noninduced genes, respectively. Starting from the full set of genes, we successively exclude genes until the gene-wise p-values of the remaining genes look like a typical sample from a uniform distribution. In comparison to other methods, our algorithm performs compatibly in detecting the shape of the local false discovery rate and has a smaller bias with respect to estimating the overall percentage of noninduced genes. Our algorithm is implemented in the Bioconductor compatible R package TWILIGHT version 1.0.1, which is available from http://compdiag.molgen.mpg.de/software or from the Bioconductor project at http://www.bioconductor.org.  相似文献   

6.
A key step in the analysis of microarray data is the selection of genes that are differentially expressed. Ideally, such experiments should be properly replicated in order to infer both technical and biological variability, and the data should be subjected to rigorous hypothesis tests to identify the differentially expressed genes. However, in microarray experiments involving the analysis of very large numbers of biological samples, replication is not always practical. Therefore, there is a need for a method to select differentially expressed genes in a rational way from insufficiently replicated data. In this paper, we describe a simple method that uses bootstrapping to generate an error model from a replicated pilot study that can be used to identify differentially expressed genes in subsequent large-scale studies on the same platform, but in which there may be no replicated arrays. The method builds a stratified error model that includes array-to-array variability, feature-to-feature variability and the dependence of error on signal intensity. We apply this model to the characterization of the host response in a model of bacterial infection of human intestinal epithelial cells. We demonstrate the effectiveness of error model based microarray experiments and propose this as a general strategy for a microarray-based screening of large collections of biological samples.  相似文献   

7.

Background  

The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.  相似文献   

8.
Microarray technology is currently one of the most widely-used technologies in biology. Many studies focus on inferring the function of an unknown gene from its co-expressed genes. Here, we are able to show that there are two types of positional artifacts in microarray data introducing spurious correlations between genes. First, we find that genes that are close on the microarray chips tend to have higher correlations between their expression profiles. We call this the 'chip artifact'. Our calculations suggest that the carry-over during the printing process is one of the major sources of this type of artifact, which is later confirmed by our experiments. Based on our experiments, the measured intensity of a microarray spot contains 0.1% (for fully-hybridized spots) to 93% (for un-hybridized ones) of noise resulting from this artifact. Secondly, we, for the first time, show that genes that are close on the microtiter plates in microarray experiments also tend to have higher correlations. We call this the 'plate artifact'. Both types of artifacts exist with different severity in all cDNA microarray experiments that we analyzed. Therefore, we develop an automated web tool-COP (COrrelations by Positional artifacts) to detect these artifacts in microarray experiments. COP has been integrated with the microarray data normalization tool, ExpressYourself, which is available at http://bioinfo.mbb.yale.edu/ExpressYourself/. Together, the two can eliminate most of the common noises in microarray data.  相似文献   

9.
As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.  相似文献   

10.
Microarray technology is rapidly emerging for genome-wide screening of differentially expressed genes between clinical subtypes or different conditions of human diseases. Traditional statistical testing approaches, such as the two-sample t-test or Wilcoxon test, are frequently used for evaluating statistical significance of informative expressions but require adjustment for large-scale multiplicity. Due to its simplicity, Bonferroni adjustment has been widely used to circumvent this problem. It is well known, however, that the standard Bonferroni test is often very conservative. In the present paper, we compare three multiple testing procedures in the microarray context: the original Bonferroni method, a Bonferroni-type improved single-step method and a step-down method. The latter two methods are based on nonparametric resampling, by which the null distribution can be derived with the dependency structure among gene expressions preserved and the family-wise error rate accurately controlled at the desired level. We also present a sample size calculation method for designing microarray studies. Through simulations and data analyses, we find that the proposed methods for testing and sample size calculation are computationally fast and control error and power precisely.  相似文献   

11.
Developmental mechanisms play an important role in determining the costs, limits, and evolutionary consequences of phenotypic plasticity. One issue central to these claims is the hypothesis of developmental decoupling, where alternate morphs result from evolutionarily independent developmental pathways. We address this assumption through a microarray study that tests whether differences in gene expression between alternate morphs are as divergent as those between sexes, a classic example of developmental decoupling. We then examine whether genes with morph‐biased expression are less conserved than genes with shared expression between morphs, as predicted if developmental decoupling relaxes pleiotropic constraints on divergence. We focus on the developing horns and brains of two species of horned beetles with impressive sexual‐ and morph‐dimorphism in the expression of horns and fighting behavior. We find that patterns of gene expression were as divergent between morphs as they were between sexes. However, overall patterns of gene expression were also highly correlated across morphs and sexes. Morph‐biased genes were more evolutionarily divergent, suggesting a role of relaxed pleiotropic constraints or relaxed selection. Together these results suggest that alternate morphs are to some extent developmentally decoupled, and that this decoupling has significant evolutionary consequences. However, alternative morphs may not be as developmentally decoupled as sometimes assumed and such hypotheses of development should be revisited and refined.  相似文献   

12.

Background  

Gene Ontology (GO) terms are often used to assess the results of microarray experiments. The most common way to do this is to perform Fisher's exact tests to find GO terms which are over-represented amongst the genes declared to be differentially expressed in the analysis of the microarray experiment. However, due to the high degree of dependence between GO terms, statistical testing is conservative, and interpretation is difficult.  相似文献   

13.
14.
MOTIVATION: In analyses of microarray data with a design of different biological conditions, ranking genes by their differential 'importance' is often desired so that biologists can focus research on a small subset of genes that are most likely related to the experiment conditions. Permutation methods are often recommended and used, in place of their parametric counterparts, due to the small sample sizes of microarray experiments and possible non-normality of the data. The recommendations, however, are based on classical knowledge in the hypothesis test setting. RESULTS: We explore the relationship between hypothesis testing and gene ranking. We indicate that the permutation method does not provide a metric for the distance between two underlying distributions. In our simulation studies permutation methods tend to be equally or less accurate than parametric methods in ranking genes. This is partially due to the discreteness of the permutation distributions, as well as the non-metric property. In data analysis the variability in ranking genes can be assessed by bootstrap. It turns out that the variability is much lower for permutation than parametric methods, which agrees with the known robustness of permutation methods to individual outliers in the data.  相似文献   

15.
16.
17.
Discrimination of disease patients based on gene expression data is a crucial problem in clinical area. An important issue to solve this problem is to find a discriminative subset of genes from thousands of genes on a microarray or DNA chip. Aiming at finding informative genes for disease classification on microarray, we present a gene selection method based on the forward variable (gene) selection method (FSM) and show, using typical public microarray datasets, that our method can extract a small set of genes being crucial for discriminating different classes with a very high accuracy almost closed to perfect classification.  相似文献   

18.
Statistical analysis of microarray data: a Bayesian approach   总被引:2,自引:0,他引:2  
The potential of microarray data is enormous. It allows us to monitor the expression of thousands of genes simultaneously. A common task with microarray is to determine which genes are differentially expressed between two samples obtained under two different conditions. Recently, several statistical methods have been proposed to perform such a task when there are replicate samples under each condition. Two major problems arise with microarray data. The first one is that the number of replicates is very small (usually 2-10), leading to noisy point estimates. As a consequence, traditional statistics that are based on the means and standard deviations, e.g. t-statistic, are not suitable. The second problem is that the number of genes is usually very large (approximately 10,000), and one is faced with an extreme multiple testing problem. Most multiple testing adjustments are relatively conservative, especially when the number of replicates is small. In this paper we present an empirical Bayes analysis that handles both problems very well. Using different parametrizations, we develop four statistics that can be used to test hypotheses about the means and/or variances of the gene expression levels in both one- and two-sample problems. The methods are illustrated using experimental data with prior knowledge. In addition, we present the result of a simulation comparing our methods to well-known statistics and multiple testing adjustments.  相似文献   

19.
MOTIVATION: The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium. RESULTS: We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community. AVAILABILITY: Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL. SUPPLEMENTARY INFORMATION: Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement.  相似文献   

20.
Methods are presented for detecting differential expression using statistical hypothesis testing methods including analysis of variance (ANOVA). Practicalities of experimental design, power, and sample size are discussed. Methods for multiple testing correction and their application are described. Instructions for running typical analyses are given in the R programming environment. R code and the sample data set used to generate the examples are available at http://microarray.cpmc.columbia.edu/pavlidis/pub/aovmethods/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号