首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subsets of genes. Several mutually independent class distinctions may be found, which is difficult to obtain from most commonly used clustering algorithms. Each class distinction can be biologically interpreted in terms of its supporting genes. The mathematical characterization of the favored class distinctions is based on statistical concepts. By analyzing three data sets from cancer gene expression studies, we demonstrate that our method is able to detect biologically relevant structures, for example cancer subtypes, in an unsupervised fashion.  相似文献   

We present a new computational technique (a software implementation, data sets, and supplementary information are available at http://www.enm.bris.ac.uk/lpd/) which enables the probabilistic analysis of cDNA microarray data and we demonstrate its effectiveness in identifying features of biomedical importance. A hierarchical Bayesian model, called Latent Process Decomposition (LPD), is introduced in which each sample in the data set is represented as a combinatorial mixture over a finite set of latent processes, which are expected to correspond to biological processes. Parameters in the model are estimated using efficient variational methods. This type of probabilistic model is most appropriate for the interpretation of measurement data generated by cDNA microarray technology. For determining informative substructure in such data sets, the proposed model has several important advantages over the standard use of dendrograms. First, the ability to objectively assess the optimal number of sample clusters. Second, the ability to represent samples and gene expression levels using a common set of latent variables (dendrograms cluster samples and gene expression values separately which amounts to two distinct reduced space representations). Third, in constrast to standard cluster models, observations are not assigned to a single cluster and, thus, for example, gene expression levels are modeled via combinations of the latent processes identified by the algorithm. We show this new method compares favorably with alternative cluster analysis methods. To illustrate its potential, we apply the proposed technique to several microarray data sets for cancer. For these data sets it successfully decomposes the data into known subtypes and indicates possible further taxonomic subdivision in addition to highlighting, in a wholly unsupervised manner, the importance of certain genes which are known to be medically significant. To illustrate its wider applicability, we also illustrate its performance on a microarray data set for yeast.  相似文献   

Logistic Multiple Regression, Principal Component Regression and Classification and Regression Tree Analysis (CART), commonly used in ecological modelling using GIS, are compared with a relatively new statistical technique, Multivariate Adaptive Regression Splines (MARS), to test their accuracy, reliability, implementation within GIS and ease of use. All were applied to the same two data sets, covering a wide range of conditions common in predictive modelling, namely geographical range, scale, nature of the predictors and sampling method. We ran two series of analyses to verify if model validation by an independent data set was required or cross‐validation on a learning data set sufficed. Results show that validation by independent data sets is needed. Model accuracy was evaluated using the area under Receiver Operating Characteristics curve (AUC). This measure was used because it summarizes performance across all possible thresholds, and is independent of balance between classes. MARS and Regression Tree Analysis achieved the best prediction success, although the CART model was difficult to use for cartographic purposes due to the high model complexity.  相似文献   

MOTIVATION: With the increasing availability of cancer microarray data sets there is a growing need for integrative computational methods that evaluate multiple independent microarray data sets investigating a common theme or disorder. Meta-analysis techniques are designed to overcome the low sample size typical to microarray experiments and yield more valid and informative results than each experiment separately. RESULTS: We propose a new meta-analysis technique that aims at finding a set of classifying genes, whose expression level may be used to answering the classification question in hand. Specifically, we apply our method to two independent lung cancer microarray data sets and identify a joint core subset of genes which putatively play an important role in tumor genesis of the lung. The robustness of the identified joint core set is demonstrated on a third unseen lung cancer data set, where it leads to successful classification using very few top-ranked genes. Identifying such a set of genes is of significant importance when searching for biologically meaningful biomarkers. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

Abstract. In European phytosociology, national classifications of corresponding vegetation types show considerable differences even between neighbouring countries. Therefore, the European Vegetation Survey project urgently needs numerical classification methods for large data sets that are able to produce compatible classifications using data sets from different countries. We tested the ability of two methods, TWINSPAN and COCKTAIL, to produce similar classifications of wet meadows (Calthion, incl. Filipendulenion) for Germany (7909 relevés) and the Czech Republic (1287 relevés) in this respect. In TWINSPAN, the indicator ordination option was used for classification of two national data sets, and the extracted assignment criteria (indicator species) were applied crosswise from one to the other national data set. Although the data sets presumably contained similar community types, TWINSPAN revealed almost no correspondence between the groups derived from the proper classification of the national data set and the groups defined by the assignment criteria taken from the other national data set. The reason is probably the difference in structure between the national data sets, which is a typical, but hardly avoidable, feature of any pair of phytosociological data sets. As a result, the first axis of the correspondence analysis, and consequently the first TWINSPAN division, are associated with different environmental gradients; the difference in the first division is transferred and multiplied further down the hierarchy. COCKTAIL is a method which produces relevé groups on the basis of statistically formed species groups. The user determines the starting points for the formation of species groups, and groups already found in one data set can be tested for existence in the other data set. The correspondence between the national classifications produced by COCKTAIL was fairly good. For some relevé groups, the lack of correspondence to groups in the other national data set could be explained by the absence of the corresponding vegetation types in one of the countries, rather than by methodological problems.  相似文献   

Basic microarray analysis: grouping and feature reduction   总被引:10,自引:0,他引:10  
DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.  相似文献   

Genes with common functions often exhibit correlated expression levels, which can be used to identify sets of interacting genes from microarray data. Microarrays typically measure expression across genomic space, creating a massive matrix of co-expression that must be mined to extract only the most relevant gene interactions. We describe a graph theoretical approach to extracting co-expressed sets of genes, based on the computation of cliques. Unlike the results of traditional clustering algorithms, cliques are not disjoint and allow genes to be assigned to multiple sets of interacting partners, consistent with biological reality. A graph is created by thresholding the correlation matrix to include only the correlations most likely to signify functional relationships. Cliques computed from the graph correspond to sets of genes for which significant edges are present between all members of the set, representing potential members of common or interacting pathways. Clique membership can be used to infer function about poorly annotated genes, based on the known functions of better-annotated genes with which they share clique membership (i.e., “guilt-by-association”). We illustrate our method by applying it to microarray data collected from the spleens of mice exposed to low-dose ionizing radiation. Differential analysis is used to identify sets of genes whose interactions are impacted by radiation exposure. The correlation graph is also queried independently of clique to extract edges that are impacted by radiation. We present several examples of multiple gene interactions that are altered by radiation exposure and thus represent potential molecular pathways that mediate the radiation response.  相似文献   



Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping.


I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.  相似文献   

Individual differences scaling is a multidimensional scaling method for finding a common ordination for several data sets. An individual ordination for each data set can then be derived from the common ordination by adjusting the axis lengths so as to maximize the correlations between observed proximities and individual ordination distances. The importance of the various axes for each data set and the mutual similarities and goodness of fit for the individual data sets are described by weight plots. As an example, 46 soft-water lakes in eastern Finland are ordinated on two dimensions according to 3 chemical data sets (water in summer and autumn, sediment) and 4 biological sets (major phytoplankton groups, phytoplankton, surface sediment diatom and cladoceran assemblages). The method seems to be effective as a means of ordination for obtaining the common ordination for the data sets. The major taxonomic groups gave the ordination which differed most clearly from the ordinations of the other data sets. Phytoplankton was most poorly ordinated in all the analyses. The other data sets were fairly coherent. When only biological data sets were ordinated, the diatoms and cladocerans showed rather different patterns. It seems that the cladocerans are best correlated with water chemistry, both according to weights in the joint analysis, and according to correlation between the axes from the biological data sets and the chemical variables.Abbreviations CCA = Canonical correspondence analysis - IDS = Individual differences scaling - MDS = multidimensional scaling - PCA = Principal components analysis  相似文献   

We generated numerous simulated gene-frequency surfaces subjected to 200 generations of isolation by distance with, in some cases, added migration or selection. From these surfaces we assembled six data sets comprising from 12 to 15 independent allele-frequency surfaces, to simulate biologically plausible population samples. The purpose of the study was to investigate whether spatial autocorrelation analysis will correctly infer the microevolutionary processes involved in each data set. The correspondence between the simulated processes and the inferences made concerning them is close for five of the six data sets. Errors in inference occurred when the effect of migration was weak, due to low gene frequency differential or low migration strength; when selection was weak and against a background with a complex pattern; and when a random process—isolation by distance—was the only one acting. Spatial correlograms proved more sensitive to detecting trends than inspection of gene-frequency surfaces by the human eye. Joint interpretation of the correlograms and their clusters proved most reliable in leading to the correct inference. The inspection and clustering of surfaces were useful for determining directional components. Because this method relies on common patterns across loci, as many gene frequencies as feasible should be used. We recommend spatial autocorrelation analysis for the detection of microevolutionary processes in natural populations.  相似文献   

Shotgun metagenomic sequencing does not depend on gene-targeted primers or PCR amplification; thus, it is not affected by primer bias or chimeras. However, searching rRNA genes from large shotgun Illumina data sets is computationally expensive, and no approach exists for unsupervised community analysis of small-subunit (SSU) rRNA gene fragments retrieved from shotgun data. We present a pipeline, SSUsearch, to achieve the faster identification of short-subunit rRNA gene fragments and enabled unsupervised community analysis with shotgun data. It also includes classification and copy number correction, and the output can be used by traditional amplicon analysis platforms. Shotgun metagenome data using this pipeline yielded higher diversity estimates than amplicon data but retained the grouping of samples in ordination analyses. We applied this pipeline to soil samples with paired shotgun and amplicon data and confirmed bias against Verrucomicrobia in a commonly used V6-V8 primer set, as well as discovering likely bias against Actinobacteria and for Verrucomicrobia in a commonly used V4 primer set. This pipeline can utilize all variable regions in SSU rRNA and also can be applied to large-subunit (LSU) rRNA genes for confirmation of community structure. The pipeline can scale to handle large amounts of soil metagenomic data (5 Gb memory and 5 central processing unit hours to process 38 Gb [1 lane] of trimmed Illumina HiSeq2500 data) and is freely available at https://github.com/dib-lab/SSUsearch under a BSD license.  相似文献   

Meta-analysis of gene expression has enabled numerous insights into biological systems, but current methods have several limitations. We developed a method to perform a meta-analysis using the elastic net, a powerful and versatile approach for classification and regression. To demonstrate the utility of our method, we conducted a meta-analysis of lung cancer gene expression based on publicly available data. Using 629 samples from five data sets, we trained a multinomial classifier to distinguish between four lung cancer subtypes. Our meta-analysis-derived classifier included 58 genes and achieved 91% accuracy on leave-one-study-out cross-validation and on three independent data sets. Our method makes meta-analysis of gene expression more systematic and expands the range of questions that a meta-analysis can be used to address. As the amount of publicly available gene expression data continues to grow, our method will be an effective tool to help distill these data into knowledge.  相似文献   

MOTIVATION: Hierarchical clustering is a common approach to study protein and gene expression data. This unsupervised technique is used to find clusters of genes or proteins which are expressed in a coordinated manner across a set of conditions. Because of both the biological and technical variability, experimental repetitions are generally performed. In this work, we propose an approach to evaluate the stability of clusters derived from hierarchical clustering by taking repeated measurements into account. RESULTS: The method is based on the bootstrap technique that is used to obtain pseudo-hierarchies of genes from resampled datasets. Based on a fast dynamic programming algorithm, we compare the original hierarchy to the pseudo-hierarchies and assess the stability of the original gene clusters. Then a shuffling procedure can be used to assess the significance of the cluster stabilities. Our approach is illustrated on simulated data and on two microarray datasets. Compared to the standard hierarchical clustering methodology, it allows to point out the dubious and stable clusters, and thus avoids misleading interpretations. AVAILABILITY: The programs were developed in C and R languages.  相似文献   

Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.  相似文献   

Mass peak alignment (ion-wise alignment) has recently become a popular method for unsupervised data analysis in untargeted metabolic profiling. Here we present MSClust-a software tool for analysis GC-MS and LC-MS datasets derived from untargeted profiling. MSClust performs data reduction using unsupervised clustering and extraction of putative metabolite mass spectra from ion-wise chromatographic alignment data. The algorithm is based on the subtractive fuzzy clustering method that allows unsupervised determination of a number of metabolites in a data set and can deal with uncertain memberships of mass peaks in overlapping mass spectra. This approach is based purely on the actual information present in the data and does not require any prior metabolite knowledge. MSClust can be applied for both GC-MS and LC-MS alignment data sets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11306-011-0368-2) contains supplementary material, which is available to authorized users.  相似文献   

The inference of phylogenetic hypotheses from landmark data has been questioned during the last two decades. Besides theoretical concerns, one of the limitations pointed out for the use of landmark data in phylogenetics is its (supposed) lack of information relevant to the inference of phylogenetic relationships. However, empirical analyses are scarce; there exists no previous study that systematically evaluates the phylogenetic performance of landmark data in a series of data sets. In the present study, we analysed 41 published data sets in order to assess the correspondence between the phylogenetic trees derived from landmark data and those obtained with alternative and independent sources of evidence, and determined the main factors that might affect this inference. The data sets presented a variable number of terminals (5–200) and configurations (1–14), belonging to different taxonomic groups. The results showed that for most of the data sets analysed, the trees derived from landmark data presented a low correspondence with the reference phylogenies. The results were similar irrespective of the phylogenetic method considered. Complementary analyses strongly suggested that the limited amount of evidence included in each data set (one or a few landmark configurations) is the main cause for that low correspondence: the phylogenetic analysis of eight data sets that presented three or more configurations clearly showed that the inclusion of several landmark configurations improves the results. In addition, the analyses indicated that the inclusion of landmark data from different configurations is more important than the inclusion of more landmarks from the same configuration. Based on the results presented here, we consider that the poor results previously obtained in phylogenetic analyses based on landmark data were not caused by methodological limitations, but rather due to the limited amount of evidence included in the data sets.  相似文献   

Outlier detection and environmental association analysis are common methods to search for loci or genomic regions exhibiting signals of adaptation to environmental factors. However, a validation of outlier loci and corresponding allele distribution models through functional molecular biology or transplant/common garden experiments is rarely carried out. Here, we employ another method for validation, namely testing outlier loci in specifically designed, independent data sets. Previously, an outlier locus associated with three different habitat types had been detected in Arabis alpina. For the independent validation data set, we sampled 30 populations occurring in these three habitat types across five biogeographic regions of the Swiss Alps. The allele distribution model found in the original study could not be validated in the independent test data set: The outlier locus was no longer indicative of habitat‐mediated selection. We propose several potential causes of this failure of validation, of which unaccounted genetic structure and technical issues in the original data set used to detect the outlier locus were most probable. Thus, our study shows that validating outlier loci and allele distribution models in independent data sets is a helpful tool in ecological genomics which, in the case of positive validation, adds confidence to outlier loci and their association with environmental factors or, in the case of failure of validation, helps to explain inconsistencies.  相似文献   

Foll M  Gaggiotti O 《Genetics》2006,174(2):875-891
The study of population genetic structure is a fundamental problem in population biology because it helps us obtain a deeper understanding of the evolutionary process. One of the issues most assiduously studied in this context is the assessment of the relative importance of environmental factors (geographic distance, language, temperature, altitude, etc.) on the genetic structure of populations. The most widely used method to address this question is the multivariate Mantel test, a nonparametric method that calculates a correlation coefficient between a dependent matrix of pairwise population genetic distances and one or more independent matrices of environmental differences. Here we present a hierarchical Bayesian method that estimates F(ST) values for each local population and relates them to environmental factors using a generalized linear model. The method is demonstrated by applying it to two data sets, a data set for a population of the argan tree and a human data set comprising 51 populations distributed worldwide. We also carry out a simulation study to investigate the performance of the method and find that it can correctly identify the factors that play a role in the structuring of genetic diversity under a wide range of scenarios.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号