首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.  相似文献   

2.
High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that could distinguish different tissue types. Of particular interest here is distinguishing between cancerous and normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered, and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified, and suggest using the "selection probability function," the probability distribution of rankings for each gene. This is estimated via the bootstrap. A real dataset, derived from gene expression arrays of 23 normal and 30 ovarian cancer tissues, is analyzed. Simulation studies are also used to assess the relative performance of different statistical gene ranking measures and our quantification of sampling variability. Our approach leads naturally to a procedure for sample-size calculations, appropriate for exploratory studies that seek to identify differentially expressed genes.  相似文献   

3.
Use of internal reference gene(s) is necessary for adequate quantification of target gene expression by RT-PCR. Herein, we elaborated a strategy of control gene selection based on microarray data and illustrated it by analyzing endomyocardial biopsies with acute cardiac rejection and infection. Using order statistics and binomial distribution we evaluated the probability of finding low-varying genes by chance. For analysis, the microarray data were divided into two sample subsets. Among the first 10% of genes with the lowest standard deviations, we found 14 genes common to both subsets. After normalization using two selected genes, high correlation was observed between expression of target genes evaluated by microarray and RT-PCR, and in independent dataset by RT-PCR (r = 0.9, p < 0.001). In conclusion, we showed a simple and reliable strategy of selection and validation of control genes for RT-PCR from microarray data that can be easily applied for different experimental designs and tissues.  相似文献   

4.
Using a measure of how differentially expressed a gene is in two biochemically/phenotypically different conditions, we can rank all genes in a microarray dataset. We have shown that the falling-off of this measure (normalized maximum likelihood in a classification model such as logistic regression) as a function of the rank is typically a power-law function. This power-law function in other similar ranked plots are known as the Zipf's law, observed in many natural and social phenomena. The presence of this power-law function prevents an intrinsic cutoff point between the "important" genes and "irrelevant" genes. We have shown that similar power-law functions are also present in permuted dataset, and provide an explanation from the well-known chi(2) distribution of likelihood ratios. We discuss the implication of this Zipf's law on gene selection in a microarray data analysis, as well as other characterizations of the ranked likelihood plots such as the rate of fall-off of the likelihood.  相似文献   

5.
Comparison of gene expression for two groups of individuals form an important subclass of microarray experiments. We study multivariate procedures, in particular use of Hotelling's T2 for discrimination between the groups with a special emphasis on methods based on few genes only. We apply the methods to data from an experiment with a group of atopic dermatitis patients compared with a control group. We also compare our methodology to other recently proposed methods on publicly available datasets. It is found that (i) use of several genes gives a much improved discrimination of the groups as compared to one gene only, (ii) the genes that play the most important role in the multivariate analysis are not necessarily those that rank first in univariate comparisons of the groups, (iii) Linear Discriminant Analysis carried out with sets of 2-5 genes selected according to their Hotelling T2 give results comparable to state-of-the-art methods using many more genes, a feature of our method which might be crucial in clinical applications. Finding groups of genes that together give optimal multivariate discrimination (given the size of the group) can identify crucial pathways and networks of genes responsible for a disease. The computer code that we developed to make computations is available as an R package.  相似文献   

6.
Huang HL  Lee CC  Ho SY 《Bio Systems》2007,90(1):78-86
It is essential to select a minimal number of relevant genes from microarray data while maximizing classification accuracy for the development of inexpensive diagnostic tests. However, it is intractable to simultaneously optimize gene selection and classification accuracy that is a large parameter optimization problem. We propose an efficient evolutionary approach to gene selection from microarray data which can be combined with the optimal design of various multiclass classifiers. The proposed method (named GeneSelect) consists of three parts which are fully cooperated: an efficient encoding scheme of candidate solutions, a generalized fitness function, and an intelligent genetic algorithm (IGA). An existing hybrid approach based on genetic algorithm and maximum likelihood classification (GA/MLHD) is proposed to select a small number of relevant genes for accurate classification of samples. To evaluate the performance of GeneSelect, the gene selection is combined with the same maximum likelihood classification (named IGA/MLHD) for convenient comparisons. The performance of IGA/MLHD is applied to 11 cancer-related human gene expression datasets. The simulation results show that IGA/MLHD is superior to GA/MLHD in terms of the number of selected genes, classification accuracy, and robustness of selected genes and accuracy.  相似文献   

7.
Testing for differentially expressed genes with microarray data   总被引:1,自引:1,他引:0       下载免费PDF全文
This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and two-sample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least five. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experiment using reference design, the two-sample t-tests appear more powerful than the one-sample t-tests.  相似文献   

8.
Finding edging genes from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: A set of genes and their gene expression levels are used to classify disease and normal tissues. Due to the massive number of genes in microarray, there are a large number of edges to divide different classes of genes in microarray space. The edging genes (EGs) can be co-regulated genes, they can also be on the same pathway or deregulated by the same non-coding genes, such as siRNA or miRNA. Every gene in EGs is vital for identifying a tissue's class. The changing in one EG's gene expression may cause a tissue alteration from normal to disease and vice versa. Finding EGs is of biological importance. In this work, we propose an algorithm to effectively find these EGs. RESULT: We tested our algorithm with five microarray datasets. The results are compared with the border-based algorithm which was used to find gene groups and subsequently divide different classes of tissues. Our algorithm finds a significantly larger amount of EGs than does the border-based algorithm. As our algorithm prunes irrelevant patterns at earlier stages, time and space complexities are much less prevalent than in the border-based algorithm. AVAILABILITY: The algorithm proposed is implemented in C++ on Linux platform. The EGs in five microarray datasets are calculated. The preprocessed datasets and the discovered EGs are available at http://www3.it.deakin.edu.au/~phoebe/microarray.html.  相似文献   

9.
10.
根据周期表达基因的周期性和峰值特点,提出了一种将microarray时序表达数据划分为若干个基因表达周期,并对周期内的峰值特点进行评估以识别周期表达基因的方法,能有效减小microarray实验时的噪声干扰。选取了三组广泛使用的时序表达数据和一组可靠的周期表达基因集合对该方法的效果进行了测试,并与三种典型的周期表达基因识别方法的效果进行了比较。该方法能有效地从各种microarray时序表达数据中识别周期表达基因。  相似文献   

11.
We demonstrate that the process of identifying differentially expressed genes in microarray studies with small sample sizes can be substantially improved by extracting information from a large number of datasets accumulated in public databases. The improvement comes from more reliable estimates of gene-specific variances based on other datasets. For a two-group comparison with two arrays in each group, for example, the result of our method was comparable to that of a t-test analysis with five samples in each group or to that of a regularized t-test analysis with three samples in each group. Our results are further improved by weighting the results of our approach with the regularized t-test results in a hybrid method.  相似文献   

12.
The effects of a drug on the genomic scale can be assessed in a three-color cDNA microarray with the three color intensities represented through the so-called hexaMplot. In our recent study, we have shown that the Hough Transform (HT) applied to the hexaMplot can be used to detect groups of coexpressed genes in the normal-disease-drug samples. However, the standard HT is not well suited for the purpose because 1) the assayed genes need first to be hard-partitioned into equally and differentially expressed genes, with HT ignoring possible information in the former group; 2) the hexaMplot coordinates are negatively correlated and there is no direct way of expressing this in the standard HT and 3) it is not clear how to quantify the association of coexpressed genes with the line along which they cluster. We address these deficiencies by formulating a dedicated probabilistic model-based HT. The approach is demonstrated by assessing effects of the drug Rg1 on homocysteine-treated human umbilical vein endothetial cells. Compared with our previous study, we robustly detect stronger natural groupings of coexpressed genes. Moreover, the gene groups show coherent biological functions with high significance, as detected by the Gene Ontology analysis.  相似文献   

13.
Analysis of microarray experiments is complicated by the huge amount of data involved. Searching for groups of co-expressed genes is akin to searching for protein families in a database as, in both cases, small subsets of genes with similar features are to be found within vast quantities of data. CLANS was originally developed to find protein families in large sets of amino acid sequences where the amount of data involved made phylogenetic approaches overly cumbersome. We present a number of improvements that greatly extend the previous version of CLANS and show its application to microarray data as well as its ability of incorporating additional information to facilitate interactive analysis. AVAILABILITY: The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/clans/  相似文献   

14.
MOTIVATION: Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs. RESULTS: All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.  相似文献   

15.
A robust bioinformatics capability is widely acknowledged as central to realizing the promises of toxicogenomics. Successful application of toxicogenomic approaches, such as DNA microarray, inextricably relies on appropriate data management, the ability to extract knowledge from massive amounts of data and the availability of functional information for data interpretation. At the FDA's National Center for Toxicological Research (NCTR), we are developing a public microarray data management and analysis software, called ArrayTrack. ArrayTrack is Minimum Information About a Microarray Experiment (MIAME) supportive for storing both microarray data and experiment parameters associated with a toxicogenomics study. A quality control mechanism is implemented to assure the fidelity of entered expression data. ArrayTrack also provides a rich collection of functional information about genes, proteins and pathways drawn from various public biological databases for facilitating data interpretation. In addition, several data analysis and visualization tools are available with ArrayTrack, and more tools will be available in the next released version. Importantly, gene expression data, functional information and analysis methods are fully integrated so that the data analysis and interpretation process is simplified and enhanced. ArrayTrack is publicly available online and the prospective user can also request a local installation version by contacting the authors.  相似文献   

16.
ArrayExpress is a new public database of microarray gene expression data at the EBI, which is a generic gene expression database designed to hold data from all microarray platforms. ArrayExpress uses the annotation standard Minimum Information About a Microarray Experiment (MIAME) and the associated XML data exchange format Microarray Gene Expression Markup Language (MAGE-ML) and it is designed to store well annotated data in a structured way. The ArrayExpress infrastructure consists of the database itself, data submissions in MAGE-ML format or via an online submission tool MIAMExpress, online database query interface, and the Expression Profiler online analysis tool. ArrayExpress accepts three types of submission, arrays, experiments and protocols, each of these is assigned an accession number. Help on data submission and annotation is provided by the curation team. The database can be queried on parameters such as author, laboratory, organism, experiment or array types. With an increasing number of organisations adopting MAGE-ML standard, the volume of submissions to ArrayExpress is increasing rapidly. The database can be accessed at http://www.ebi.ac.uk/arrayexpress.  相似文献   

17.
MOTIVATION: Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50-200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task. RESULTS: To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification. AVAILABILITY: The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm CONTACT: wyh@cs.dartmouth.edu SUPPLEMENTARY INFORMATION: Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm.  相似文献   

18.
Fuzzy J-Means and VNS methods for clustering genes from microarray data   总被引:4,自引:0,他引:4  
MOTIVATION: In the interpretation of gene expression data from a group of microarray experiments that include samples from either different patients or conditions, special consideration must be given to the pleiotropic and epistatic roles of genes, as observed in the variation of gene coexpression patterns. Crisp clustering methods assign each gene to one cluster, thereby omitting information about the multiple roles of genes. RESULTS: Here, we present the application of a local search heuristic, Fuzzy J-Means, embedded into the variable neighborhood search metaheuristic for the clustering of microarray gene expression data. We show that for all the datasets studied this algorithm outperforms the standard Fuzzy C-Means heuristic. Different methods for the utilization of cluster membership information in determining gene coregulation are presented. The clustering and data analyses were performed on simulated datasets as well as experimental cDNA microarray data for breast cancer and human blood from the Stanford Microarray Database. AVAILABILITY: The source code of the clustering software (C programming language) is freely available from Nabil.Belacel@nrc-cnrc.gc.ca  相似文献   

19.
Although two-color fluorescent DNA microarrays are now standard equipment in many molecular biology laboratories, methods for identifying differentially expressed genes in microarray data are still evolving. Here, we report a refined test for differentially expressed genes which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene. This test uses a statistical model to describe multiplicative and additive errors influencing an array experiment, where model parameters are estimated from observed intensities for all genes using the method of maximum likelihood. A generalized likelihood ratio test is performed for each gene to determine whether, under the model, these intensities are significantly different. We use this method to identify significant differences in gene expression among yeast cells growing in galactose-stimulating versus non-stimulating conditions and compare our results with current approaches for identifying differentially-expressed genes. The effect of sample size on parameter optimization is also explored, as is the use of the error model to compare the within- and between-slide intensity variation intrinsic to an array experiment.  相似文献   

20.

Background  

Gene microarray technology provides the ability to study the regulation of thousands of genes simultaneously, but its potential is limited without an estimate of the statistical significance of the observed changes in gene expression. Due to the large number of genes being tested and the comparatively small number of array replicates (e.g., N = 3), standard statistical methods such as the Student's t-test fail to produce reliable results. Two other statistical approaches commonly used to improve significance estimates are a penalized t-test and a Z-test using intensity-dependent variance estimates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号