首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒


Human cancers are complex ecosystems composed of cells with distinct molecular signatures. Such intratumoral heterogeneity poses a major challenge to cancer diagnosis and treatment. Recent advancements of single-cell techniques such as scRNA-seq have brought unprecedented insights into cellular heterogeneity. Subsequently, a challenging computational problem is to cluster high dimensional noisy datasets with substantially fewer cells than the number of genes.


In this paper, we introduced a consensus clustering framework conCluster, for cancer subtype identification from single-cell RNA-seq data. Using an ensemble strategy, conCluster fuses multiple basic partitions to consensus clusters.


Applied to real cancer scRNA-seq datasets, conCluster can more accurately detect cancer subtypes than the widely used scRNA-seq clustering methods. Further, we conducted co-expression network analysis for the identified melanoma subtypes.


Our analysis demonstrates that these subtypes exhibit distinct gene co-expression networks and significant gene sets with different functional enrichment.

RNA-Seq technologies are quickly revolutionizing genomic studies, and statistical methods for RNA-seq data are under continuous development. Timely review and comparison of the most recently proposed statistical methods will provide a useful guide for choosing among them for data analysis. Particular interest surrounds the ability to detect differential expression (DE) in genes. Here we compare four recently proposed statistical methods, edgeR, DESeq, baySeq, and a method with a two-stage Poisson model (TSPM), through a variety of simulations that were based on different distribution models or real data. We compared the ability of these methods to detect DE genes in terms of the significance ranking of genes and false discovery rate control. All methods compared are implemented in freely available software. We also discuss the availability and functions of the currently available versions of these software.  相似文献   

Tan YD  Fornage M  Fu YX 《Genomics》2006,88(6):846-854
Microarray technology provides a powerful tool for the expression profile of thousands of genes simultaneously, which makes it possible to explore the molecular and metabolic etiology of the development of a complex disease under study. However, classical statistical methods and technologies fail to be applicable to microarray data. Therefore, it is necessary and motivating to develop powerful methods for large-scale statistical analyses. In this paper, we described a novel method, called Ranking Analysis of Microarray Data (RAM). RAM, which is a large-scale two-sample t-test method, is based on comparisons between a set of ranked T statistics and a set of ranked Z values (a set of ranked estimated null scores) yielded by a "randomly splitting" approach instead of a "permutation" approach and a two-simulation strategy for estimating the proportion of genes identified by chance, i.e., the false discovery rate (FDR). The results obtained from the simulated and observed microarray data show that RAM is more efficient in identification of genes differentially expressed and estimation of FDR under undesirable conditions such as a large fudge factor, small sample size, or mixture distribution of noises than Significance Analysis of Microarrays.  相似文献   

MOTIVATION: To evaluate microarray data, clustering is widely used to group biological samples or genes. However, problems arise when comparing heterologous databases. As the clustering algorithm searches for similarities between experiments, it will most likely first separate the data sets, masking relationships that exist between samples from different databases. RESULTS: We developed a program, Venn Mapper, to calculate the statistical significance of the number of co-occurring differentially expressed genes in any of the two experiments. For proof of principle, we analysed a heterologous data set of 170 microarrays including breast and prostate cancer microarray analyses. Significant overlap was found in an unsupervised analysis between metastasized prostate cancer and metastasized breast cancer and BRCA mutated breast cancer. A comparison between single microarray data and the averaged breast and prostate data sets was also evaluated. This analysis suggests that genes expressed higher in stromal cells are also implicated in metastatic prostate cancer and BRCA mutated breast cancer. The Venn Mapper program identifies overlaps between samples from heterologous data sets and directly extracts the genes responsible for the overlap. From this information novel biological hypotheses may be addressed. AVAILABILITY: Venn Mapper is freely available on http://www.erasmusmc.nl/gatcplatform. SUPPLEMENTARY INFORMATION: http://www.erasmusmc.nl/gatcplatform/vennmapper.html.  相似文献   

To detect changes in gene expression data from microarrays, a fixed threshold for fold difference is used widely. However, it is not always guaranteed that a threshold value which is appropriate for highly expressed genes is suitable for lowly expressed genes. In this study, aiming at detecting truly differentially expressed genes from a wide expression range, we proposed an adaptive threshold method (AT). The adaptive thresholds, which have different values for different expression levels, are calculated based on two measurements under the same condition. The sensitivity, specificity and false discovery rate (FDR) of AT were investigated by simulations. The sensitivity and specificity under various noise conditions were greater than 89.7% and 99.32%, respectively. The FDR was smaller than 0.27. These results demonstrated the reliability of the method.  相似文献   

A Bayesian model-based clustering approach is proposed for identifying differentially expressed genes in meta-analysis. A Bayesian hierarchical model is used as a scientific tool for combining information from different studies, and a mixture prior is used to separate differentially expressed genes from non-differentially expressed genes. Posterior estimation of the parameters and missing observations are done by using a simple Markov chain Monte Carlo method. From the estimated mixture model, useful measure of significance of a test such as the Bayesian false discovery rate (FDR), the local FDR (Efron et al., 2001), and the integration-driven discovery rate (IDR; Choi et al., 2003) can be easily computed. The model-based approach is also compared with commonly used permutation methods, and it is shown that the model-based approach is superior to the permutation methods when there are excessive under-expressed genes compared to over-expressed genes or vice versa. The proposed method is applied to four publicly available prostate cancer gene expression data sets and simulated data sets.  相似文献   



The biomedical community is rapidly developing new methods of data analysis for microarray experiments, with the goal of establishing new standards to objectively process the massive datasets produced from functional genomic experiments. Each microarray experiment measures thousands of genes simultaneously producing an unprecedented amount of biological information across increasingly numerous experiments; however, in general, only a very small percentage of the genes present on any given array are identified as differentially regulated. The challenge then is to process this information objectively and efficiently in order to obtain knowledge of the biological system under study and by which to compare information gained across multiple experiments. In this context, systematic and objective mathematical approaches, which are simple to apply across a large number of experimental designs, become fundamental to correctly handle the mass of data and to understand the true complexity of the biological systems under study.  相似文献   

MOTIVATION: Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs. RESULTS: All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.  相似文献   

An exciting biological advancement over the past few years is the use of microarray technologies to measure simultaneously the expression levels of thousands of genes. The bottleneck now is how to extract useful information from the resulting large amounts of data. An important and common task in analyzing microarray data is to identify genes with altered expression under two experimental conditions. We propose a nonparametric statistical approach, called the mixture model method (MMM), to handle the problem when there are a small number of replicates under each experimental condition. Specifically, we propose estimating the distributions of a t -type test statistic and its null statistic using finite normal mixture models. A comparison of these two distributions by means of a likelihood ratio test, or simply using the tail distribution of the null statistic, can identify genes with significantly changed expression. Several methods are proposed to effectively control the false positives. The methodology is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle ear infection.  相似文献   



Clusters of genes co-expressed are known in prokaryotes (operons) and were recently described in several eukaryote organisms, including Human. According to some studies, these clusters consist of housekeeping genes, whereas other studies suggest that these clustered genes exhibit similar tissue specificity. Here we further explore the relationship between co-expression and chromosomal co-localization in the human genome by analyzing the expression status of the genes along the best-annotated chromosomes 20, 21 and 22.  相似文献   

The spontaneously hypertensive rat (SHR) is a model of human essential hypertension. Increased blood pressure in SHR is associated with other risk factors associated with cardiovascular disease, including insulin resistance and dyslipidemia. DNA microarray studies identified over 200 differentially expressed genes and ESTs between SHR and normotensive control rats. These clones represent candidate genes that may underlie previously detected QTLs in SHR. This study made use of the publication of two whole-genome maps to identify positional QTL candidates. Radiation hybrid (RH) mapping was used to determine the chromosomal locations of 70 rat genes and ESTs from this dataset. Most of the locations are novel, but in five cases we identified a definitive map location for genes previously mapped by somatic cell hybrids and/or linkage analysis. Genes for which the mouse genome map location was already determined mapped to syntenic segments in the rat genome map, except for two rat genes whose map locations confirmed previous findings. Where synteny comparisons could be made only with the human, 74% of the genes mapped in this study lay in a conserved syntenic segment. Chromosomal localisation of these mouse and human orthologs to syntenic segments produces a high level of confidence in the data presented in this study. The data provide new map locations for rat genes and will aid efforts to advance the rat genome map. The data may also be used to prioritize candidate QTL genes in SHR and other rat strains on the basis of their map location.  相似文献   

Cluster Identification Tool (CIT) is a microarray analysis program that identifies differentially expressed genes. Following division of experimental samples based on a parameter of interest, CIT uses a statistical discrimination metric and permutation analysis to identify clusters of genes or individual genes that best differentiate between the experimental groups. CIT integrates with the freely available CLUSTER and TREEVIEW programs to form a more complete microarray analysis package.  相似文献   

The single-cell RNA sequencing (scRNA-seq) technologies obtain gene expression at single-cell resolution and provide a tool for exploring cell heterogeneity and cell types. As the low amount of extracted mRNA copies per cell, scRNA-seq data exhibit a large number of dropouts, which hinders the downstream analysis of the scRNA-seq data. We propose a statistical method, SDImpute (Single-cell RNA-seq Dropout Imputation), to implement block imputation for dropout events in scRNA-seq data. SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells. In the experiments, the results of the simulated datasets and real datasets suggest that SDImpute is an effective tool to recover the data and preserve the heterogeneity of gene expression across cells. Compared with the state-of-the-art imputation methods, SDImpute improves the accuracy of the downstream analysis including clustering, visualization, and differential expression analysis.  相似文献   

MOTIVATION: Microarray technology emerges as a powerful tool in life science. One major application of microarray technology is to identify differentially expressed genes under various conditions. Currently, the statistical methods to analyze microarray data are generally unsatisfactory, mainly due to the lack of understanding of the distribution and error structure of microarray data. RESULTS: We develop a generalized likelihood ratio (GLR) test based on the two-component model proposed by Rocke and Durbin to identify differentially expressed genes from microarray data. Simulation studies show that the GLR test is more powerful than commonly used methods, like the fold-change method and the two-sample t-test. When applied to microarray data, the GLR test identifies more differentially expressed genes than the t-test, has a lower false discovery rate and shows more consistency over independently repeated experiments. AVAILABILITY: The approach is implemented in software called GLR, which is freely available for downloading at http://www.cc.utah.edu/~jw27c60  相似文献   



Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. In these experiments, the goal is to identify genes differentially expressed in time-course data, measured between different biological conditions. These differentially expressed genes can reveal the changes in biological process due to the change in condition which is essential to understand differences in dynamics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号