首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Principal component analysis for clustering gene expression data   总被引:15,自引:0,他引:15  
MOTIVATION: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes. RESULTS: Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.  相似文献   

2.
Model-based clustering and data transformations for gene expression data.   总被引:20,自引:0,他引:20  
MOTIVATION: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. RESULTS: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. AVAILABILITY: MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development. CONTACT: kayee@cs.washington.edu. SUPPLEMENTARY INFORMATION: http://www.cs.washington.edu/homes/kayee/model.  相似文献   

3.
Single-molecule force spectroscopy has become a versatile tool for investigating the (un)folding of proteins and other polymeric molecules. Like other single-molecule techniques, single-molecule force spectroscopy requires recording and analysis of large data sets to extract statistically meaningful conclusions. Here, we present a data analysis tool that provides efficient filtering of heterogeneous data sets, brings spectra into register based on a reference-free alignment algorithm, and determines automatically the location of unfolding barriers. Furthermore, it groups spectra according to the number of unfolding events, subclassifies the spectra using cross correlation-based sorting, and extracts unfolding pathways by principal component analysis and clustering methods to extracted peak positions. Our approach has been tested on a data set obtained through mechanical unfolding of bacteriorhodopsin (bR), which contained a significant number of spectra that did not show the well-known bR fingerprint. In addition, we have tested the performance of the data analysis tool on unfolding data of the soluble multidomain (Ig27)(8) protein.  相似文献   

4.
MOTIVATION: Liquid chromatography coupled to mass spectrometry (LC-MS) and combined with tandem mass spectrometry (LC-MS/MS) have become a prominent tool for the analysis of complex proteomic samples. An important step in a typical workflow is the combination of results from multiple LC-MS experiments to improve confidence in the obtained measurements or to compare results from different samples. To do so, a suitable mapping or alignment between the data sets needs to be estimated. The alignment has to correct for variations in mass and elution time which are present in all mass spectrometry experiments. RESULTS: We propose a novel algorithm to align LC-MS samples and to match corresponding ion species across samples. Our algorithm matches landmark signals between two data sets using a geometric technique based on pose clustering. Variations in mass and retention time are corrected by an affine dewarping function estimated from matched landmarks. We use the pairwise dewarping in an algorithm for aligning multiple samples. We show that our pose clustering approach is fast and reliable as compared to previous approaches. It is robust in the presence of noise and able to accurately align samples with only few common ion species. In addition, we can easily handle different kinds of LC-MS data and adopt our algorithm to new mass spectrometry technologies. AVAILABILITY: This algorithm is implemented as part of the OpenMS software library for shotgun proteomics and available under the Lesser GNU Public License (LGPL) at www.openms.de.  相似文献   

5.
MOTIVATION: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. RESULTS: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression. AVAILABILITY: GaneSh, a Java package for coclustering, is available under the terms of the GNU General Public License from our website at http://bioinformatics.psb.ugent.be/software  相似文献   

6.
One of the key developmental processes during photomorphogenesis is the differentiation of prolamellar bodies of proplastids into thylakoid membranes containing the photosynthetic pigment-protein complexes of chloroplasts. To study the regulatory events controlling pigment-protein complex assembly, including the biosynthesis of metabolic precursors and pigment end products, etiolated Arabidopsis thaliana seedlings were irradiated with continuous red light (Rc), which led to rapid greening, or continuous far-red light (FRc), which did not result in visible greening, and subjected to analysis by oligonucleotide microarrays and targeted metabolite profiling. An analysis using BioPathAt, a bioinformatic tool that allows the visualization of post-genomic data sets directly on biochemical pathway maps, indicated that in Rc-treated seedlings mRNA expression and metabolite patterns were tightly correlated (e.g., Calvin cycle, biosynthesis of chlorophylls, carotenoids, isoprenoid quinones, thylakoid lipids, sterols, and amino acids). K-means clustering revealed that gene expression patterns across various biochemical pathways were very similar in Rc- and FRc-treated seedlings (despite the visible phenotypic differences), whereas a principal component analysis of metabolite pools allowed a clear distinction between both treatments (in accordance with the visible phenotype). Our results illustrate the general importance of integrative approaches to correlate post-genomic data sets with phenotypic outcomes.  相似文献   

7.
MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf  相似文献   

8.
Kernel principal component analysis (KPCA) has been applied to data clustering and graphic cut in the last couple of years. This paper discusses the application of KPCA to microarray data clustering. A new algorithm based on KPCA and fuzzy C-means is proposed. Experiments with microarray data show that the proposed algorithms is in general superior to traditional algorithms.  相似文献   

9.
Ensemble clustering methods have become increasingly important to ease the task of choosing the most appropriate cluster algorithm for a particular data analysis problem. The consensus clustering (CC) algorithm is a recognized ensemble clustering method that uses an artificial intelligence technique to optimize a fitness function. We formally prove the existence of a subspace of the search space for CC, which contains all solutions of maximal fitness and suggests two greedy algorithms to search this subspace. We evaluate the algorithms on two gene expression data sets and one synthetic data set, and compare the result with the results of other ensemble clustering approaches.  相似文献   

10.
Access to real-time process information is desirable for consistent and efficient operation of bioprocesses. Near-infrared spectroscopy (NIRS) is known to have potential for providing real-time information on the quantitative levels of important bioprocess variables. However, given the fact that a typical NIR spectrum encompasses information regarding almost all the constituents of the sample matrix, there are few case studies that have investigated the spectral details for applications in bioprocess quality assessment or qualitative bioprocess monitoring. Such information would be invaluable in providing operator-level assistance on the progress of a bioprocess in industrial-scale productions. We investigated this aspect and report the results of our investigation. Near-infrared spectral information derived from scanning unprocessed culture fluid (broth) samples from a complex antibiotic production process was assessed for a data set that incorporated bioprocess variations. Principal component analysis was applied to the spectral data and the loadings and scores of the principal components studied. Changes in the spectral information that corresponded to variations in the bioprocess could be deciphered. Despite the complexity of the matrix, near-infrared spectra of the culture broth are shown to have valuable information that can be deconvoluted with the help of factor analysis techniques such as principal component analysis (PCA). Although complex to interpret, the loadings and score plots are shown to offer potential in process diagnosis that could be of value in the rapid assessment of process quality, and in data assessment prior to quantitative model development.  相似文献   

11.

Background  

The versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.  相似文献   

12.
In this investigation, the fermentation step of a standard mammalian cell-based industrial bioprocess for the production of a therapeutic protein was studied, with particular emphasis on the evolution of cell viability. This parameter constitutes one of the critical variables for bioprocess monitoring since it can affect downstream operations and the quality of the final product. In addition, when the cells experiment an unpredictable drop in viability, the assessment of this variable through classic off-line methods may not provide information sufficiently in advance to take corrective actions. In this context, Process Analytical Technology (PAT) framework aims to develop novel strategies for more efficient monitoring of critical variables, in order to improve the bioprocess performance. Thus, in this work, a set of chemometric tools were integrated to establish a PAT strategy to monitor cell viability, based on fluorescence multiway data obtained from fermentation samples of a particular bioprocess, in two different scales of operation. The spectral information, together with data regarding process variables, was integrated through chemometric exploratory tools to characterize the bioprocess and stablish novel criteria for the monitoring of cell viability. These findings motivated the development of a multivariate classification model, aiming to obtain predictive tools for the monitoring of future lots of the same bioprocess. The model could be satisfactorily fitted, showing the non-error rate of prediction of 100%.  相似文献   

13.
MOTIVATION: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. RESULTS: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. AVAILABILITY: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.  相似文献   

14.
15.
Particle classification is an important component of multivariate statistical analysis methods that has been used extensively to extract information from electron micrographs of single particles. Here we describe a new Bayesian Gibbs sampling algorithm for the classification of such images. This algorithm, which is applied after dimension reduction by correspondence analysis or by principal components analysis, dynamically learns the parameters of the multivariate Gaussian distributions that characterize each class. These distributions describe tilted ellipsoidal clusters that adaptively adjust shape to capture differences in the variances of factors and the correlations of factors within classes. A novel Bayesian procedure to objectively select factors for inclusion in the classification models is a component of this procedure. A comparison of this algorithm with hierarchical ascendant classification of simulated data sets shows improved classification over a broad range of signal-to-noise ratios.  相似文献   

16.
Multi-class clustering and prediction in the analysis of microarray data   总被引:1,自引:0,他引:1  
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.  相似文献   

17.
MOTIVATION: Clustering microarray gene expression data is a powerful tool for elucidating co-regulatory relationships among genes. Many different clustering techniques have been successfully applied and the results are promising. However, substantial fluctuation contained in microarray data, lack of knowledge on the number of clusters and complex regulatory mechanisms underlying biological systems make the clustering problems tremendously challenging. RESULTS: We devised an improved model-based Bayesian approach to cluster microarray gene expression data. Cluster assignment is carried out by an iterative weighted Chinese restaurant seating scheme such that the optimal number of clusters can be determined simultaneously with cluster assignment. The predictive updating technique was applied to improve the efficiency of the Gibbs sampler. An additional step is added during reassignment to allow genes that display complex correlation relationships such as time-shifted and/or inverted to be clustered together. Analysis done on a real dataset showed that as much as 30% of significant genes clustered in the same group display complex relationships with the consensus pattern of the cluster. Other notable features including automatic handling of missing data, quantitative measures of cluster strength and assignment confidence. Synthetic and real microarray gene expression datasets were analyzed to demonstrate its performance. AVAILABILITY: A computer program named Chinese restaurant cluster (CRC) has been developed based on this algorithm. The program can be downloaded at http://www.sph.umich.edu/csg/qin/CRC/.  相似文献   

18.
19.
We present a new computational technique (a software implementation, data sets, and supplementary information are available at http://www.enm.bris.ac.uk/lpd/) which enables the probabilistic analysis of cDNA microarray data and we demonstrate its effectiveness in identifying features of biomedical importance. A hierarchical Bayesian model, called Latent Process Decomposition (LPD), is introduced in which each sample in the data set is represented as a combinatorial mixture over a finite set of latent processes, which are expected to correspond to biological processes. Parameters in the model are estimated using efficient variational methods. This type of probabilistic model is most appropriate for the interpretation of measurement data generated by cDNA microarray technology. For determining informative substructure in such data sets, the proposed model has several important advantages over the standard use of dendrograms. First, the ability to objectively assess the optimal number of sample clusters. Second, the ability to represent samples and gene expression levels using a common set of latent variables (dendrograms cluster samples and gene expression values separately which amounts to two distinct reduced space representations). Third, in constrast to standard cluster models, observations are not assigned to a single cluster and, thus, for example, gene expression levels are modeled via combinations of the latent processes identified by the algorithm. We show this new method compares favorably with alternative cluster analysis methods. To illustrate its potential, we apply the proposed technique to several microarray data sets for cancer. For these data sets it successfully decomposes the data into known subtypes and indicates possible further taxonomic subdivision in addition to highlighting, in a wholly unsupervised manner, the importance of certain genes which are known to be medically significant. To illustrate its wider applicability, we also illustrate its performance on a microarray data set for yeast.  相似文献   

20.
This study was performed in order to evaluate a new LED‐based 2D‐fluorescence spectrometer for in‐line bioprocess monitoring of Chinese hamster ovary (CHO) cell culture processes. The new spectrometer used selected excitation wavelengths of 280, 365, and 455 nm to collect spectral data from six 10‐L fed‐batch processes. The technique provides data on various fluorescent compounds from the cultivation medium as well as from cell metabolism. In addition, scattered light offers information about the cultivation status. Multivariate data analysis tools were applied to analyze the large data sets of the collected fluorescence spectra. First, principal component analysis was used to accomplish an overview of all spectral data from all six CHO cultivations. Partial least square regression models were developed to correlate 2D‐fluorescence spectral data with selected critical process variables as offline reference values. A separate independent fed‐batch process was used for model validation and prediction. An almost continuous in‐line bioprocess monitoring was realized because 2D‐fluorescence spectra were collected every 10 min during the whole cultivation. The new 2D‐fluorescence device demonstrates the significant potential for accurate prediction of the total cell count, viable cell count, and the cell viability. The results strongly indicated that the technique is particularly capable to distinguish between different cell statuses inside the bioreactor. In addition, spectral data provided information about the lactate metabolism shift and cellular respiration during the cultivation process. Overall, the 2D‐fluorescence device is a highly sensitive tool for process analytical technology applications in mammalian cell cultures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号