首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.

Background  

The versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.  相似文献   

3.
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.  相似文献   

4.
Standard clustering algorithms when applied to DNA microarray data often tend to produce erroneous clusters. A major contributor to this divergence is the feature characteristic of microarray data sets that the number of predictors (genes) in such data far exceeds the number of samples by many orders of magnitude, with only a small percentage of predictors being truly informative with regards to the clustering while the rest merely add noise. An additional complication is that the predictors exhibit an unknown complex correlational configuration embedded in a small subspace of the entire predictor space. Under these conditions, standard clustering algorithms fail to find the true clusters even when applied in tandem with some sort of gene filtering or dimension reduction to reduce the number of predictors. We propose, as an alternative, a novel method for unsupervised classification of DNA microarray data. The method, which is based on the idea of aggregating results obtained from an ensemble of randomly resampled data (where both samples and genes are resampled), introduces a way of tilting the procedure so that the ensemble includes minimal representation from less important areas of the gene predictor space. The method produces a measure of dissimilarity between each pair of samples that can be used in conjunction with (a) a method like Ward's procedure to generate a cluster analysis and (b) multidimensional scaling to generate useful visualizations of the data. We call the dissimilarity measures ABC dissimilarities since they are obtained by aggregating bundles of clusters. An extensive comparison of several clustering methods using actual DNA microarray data convincingly demonstrates that classification using ABC dissimilarities offers significantly superior performance.  相似文献   

5.
Multi-class clustering and prediction in the analysis of microarray data   总被引:1,自引:0,他引:1  
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.  相似文献   

6.
Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER’s selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.  相似文献   

7.
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both "within" and "between" cluster profiles. We explore various flexible "within-cluster" parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse "between-cluster" representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.  相似文献   

8.
MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf  相似文献   

9.
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.  相似文献   

10.
Bayesian mixture model based clustering of replicated microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements. In such situations, Bayesian mixtures allow for an efficient use of information by precisely modeling between-replicates variability. RESULTS: We developed different variants of Bayesian mixture based clustering procedures for clustering gene expression data with experimental replicates. In this approach, the statistical distribution of microarray data is described by a Bayesian mixture model. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. We define infinite and finite Bayesian mixture models with different between-replicates variance structures and investigate their utility by analyzing synthetic and the real-world datasets. Results of our analyses demonstrate that (1) improvements in precision achieved by performing only two experimental replicates can be dramatic when the between-replicates variability is high, (2) precise modeling of intra-gene variability is important for accurate identification of co-expressed genes and (3) the infinite mixture model with the 'elliptical' between-replicates variance structure performed overall better than any other method tested. We also introduce a heuristic modification to the Gibbs sampler based on the 'reverse annealing' principle. This modification effectively overcomes the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions. Finally, we demonstrate that the Bayesian infinite mixture model with 'elliptical' variance structure is capable of identifying the underlying structure of the data without knowing the 'correct' number of clusters. AVAILABILITY: The MS Windows based program named Gaussian Infinite Mixture Modeling (GIMM) implementing the Gibbs sampler and corresponding C++ code are available at http://homepages.uc.edu/~medvedm/GIMM.htm SUPPLEMENTAL INFORMATION: http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html  相似文献   

11.
MOTIVATION: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. RESULTS: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression. AVAILABILITY: GaneSh, a Java package for coclustering, is available under the terms of the GNU General Public License from our website at http://bioinformatics.psb.ugent.be/software  相似文献   

12.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets  相似文献   

13.
MOTIVATION: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery. RESULTS: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values < or = 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile. AVAILABILITY: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp.  相似文献   

14.
结合基因功能分类体系Gene Ontology筛选聚类特征基因   总被引:3,自引:0,他引:3  
使用两套基因表达谱数据,按各基因的表达值方差,选择表达变异基因对样本聚类,发现一般使用方差较大的前10%的基因作为特征基因,就可以较好地对疾病样本聚类。对不同的疾病,包含聚类信息的特征基因有不同的分布特点。在此基础上,结合基因功能分类体系(Gene Ontology,GO),进一步筛选聚类的特征基因。通过检验在Gene Ontology中的每个功能类中的表达变异基因是否非随机地聚集,寻找疾病相关功能类,再根据相关功能类中的表达变异基因进行聚类分析。实验结果显示:结合基因功能体系进一步筛选表达变异基因作为聚类特征基因,可以保持或提高聚类准确性,并使得聚类结果具有明确的生物学意义。另外,发现了一些可能和淋巴瘤和白血病相关的基因。  相似文献   

15.
MOTIVATION: Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50-200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task. RESULTS: To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification. AVAILABILITY: The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm CONTACT: wyh@cs.dartmouth.edu SUPPLEMENTARY INFORMATION: Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm.  相似文献   

16.
17.
MOTIVATION: An important goal of microarray studies is to discover genes that are associated with clinical outcomes, such as disease status and patient survival. While a typical experiment surveys gene expressions on a global scale, there may be only a small number of genes that have significant influence on a clinical outcome. Moreover, expression data have cluster structures and the genes within a cluster have correlated expressions and coordinated functions, but the effects of individual genes in the same cluster may be different. Accordingly, we seek to build statistical models with the following properties. First, the model is sparse in the sense that only a subset of the parameter vector is non-zero. Second, the cluster structures of gene expressions are properly accounted for. RESULTS: For gene expression data without pathway information, we divide genes into clusters using commonly used methods, such as K-means or hierarchical approaches. The optimal number of clusters is determined using the Gap statistic. We propose a clustering threshold gradient descent regularization (CTGDR) method, for simultaneous cluster selection and within cluster gene selection. We apply this method to binary classification and censored survival analysis. Compared to the standard TGDR and other regularization methods, the CTGDR takes into account the cluster structure and carries out feature selection at both the cluster level and within-cluster gene level. We demonstrate the CTGDR on two studies of cancer classification and two studies correlating survival of lymphoma patients with microarray expressions. AVAILABILITY: R code is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

18.

Background  

The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Biclustering in particular has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Biclustering algorithms also have important applications in sample classification where, for instance, tissue samples can be classified as cancerous or normal. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the "best" grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters.  相似文献   

19.
Hoff PD 《Biometrics》2005,61(4):1027-1036
This article develops a model-based approach to clustering multivariate binary data, in which the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The clustering approach is based on a multivariate Dirichlet process mixture model, which allows for the estimation of the number of clusters, the cluster memberships, and the cluster-specific parameters in a unified way. Such a clustering approach has applications in the analysis of genomic abnormality data, in which the development of different types of tumors may depend on the presence of certain abnormalities at subsets of locations along the genome. Additionally, such a mixture model provides a nonparametric estimation scheme for dependent sequences of binary data.  相似文献   

20.
MOTIVATION: It is a common practice in cancer microarray experiments that a normal tissue is collected from the same individual from whom the tumor tissue was taken. The indirect design is usually adopted for the experiment that uses a common reference RNA hybridized both to normal and tumor tissues. However, it is often the case that the test material is not large enough for the experimenter to extract enough RNA to conduct the microarray experiment. Hence, collecting n cases does not necessarily end up with a matched pair sample of size n. Instead we usually have a matched pair sample of size n1, and two independent samples of sizes n2 and n3, respectively, for 'reference versus normal tissue only' and 'reference versus tumor tissue only' hybridizations (n=n1 + n2 + n3). Standard statistical methods need to be modified and new statistical procedures are developed for analyzing this mixed dataset. RESULTS: We propose a new test statistic, t3, as a means of combining all the information in the mixed dataset for detecting differentially expressed (DE) genes between normal and tumor tissues. We employed the extended receiver operating characteristic approach to the mixed dataset. We devised a measure of disagreement between a RT-PCR experiment and a microarray experiment. Hotelling's T2 statistic is employed to detect a set of DE genes and its prediction rate is compared with the prediction rate of a univariate procedure. We observe that Hotelling's T2 statistic detects DE genes more efficiently than a univariate procedure and that further research is warranted on the formal test procedure using Hotelling's T2 statistic. CONTACT: bskim@yonsei.ac.kr.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号