首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.  相似文献   

2.
MOTIVATION: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into 'meaningful' groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified. RESULTS: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.  相似文献   

3.
This article describes three multivariate projection methods and compares them for their ability to identify clusters of biological samples and genes using real-life data on gene expression levels of leukemia patients. It is shown that principal component analysis (PCA) has the disadvantage that the resulting principal factors are not very informative, while correspondence factor analysis (CFA) has difficulties interpreting distances between objects. Spectral map analysis (SMA) is introduced as an alternative approach to the analysis of microarray data. Weighted SMA outperforms PCA, and is at least as powerful as CFA, in finding clusters in the samples, as well as identifying genes related to these clusters. SMA addresses the problem of data analysis in microarray experiments in a more appropriate manner than CFA, and allows more flexible weighting to the genes and samples. Proper weighting is important, since it enables less reliable data to be down-weighted and more reliable information to be emphasized.  相似文献   

4.
Standard clustering algorithms when applied to DNA microarray data often tend to produce erroneous clusters. A major contributor to this divergence is the feature characteristic of microarray data sets that the number of predictors (genes) in such data far exceeds the number of samples by many orders of magnitude, with only a small percentage of predictors being truly informative with regards to the clustering while the rest merely add noise. An additional complication is that the predictors exhibit an unknown complex correlational configuration embedded in a small subspace of the entire predictor space. Under these conditions, standard clustering algorithms fail to find the true clusters even when applied in tandem with some sort of gene filtering or dimension reduction to reduce the number of predictors. We propose, as an alternative, a novel method for unsupervised classification of DNA microarray data. The method, which is based on the idea of aggregating results obtained from an ensemble of randomly resampled data (where both samples and genes are resampled), introduces a way of tilting the procedure so that the ensemble includes minimal representation from less important areas of the gene predictor space. The method produces a measure of dissimilarity between each pair of samples that can be used in conjunction with (a) a method like Ward's procedure to generate a cluster analysis and (b) multidimensional scaling to generate useful visualizations of the data. We call the dissimilarity measures ABC dissimilarities since they are obtained by aggregating bundles of clusters. An extensive comparison of several clustering methods using actual DNA microarray data convincingly demonstrates that classification using ABC dissimilarities offers significantly superior performance.  相似文献   

5.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.  相似文献   

6.
Computational analysis methods for gene expression data gathered in microarray experiments can be used to identify the functions of previously unstudied genes. While obtaining the expression data is not a difficult task, interpreting and extracting the information from the datasets is challenging. In this study, a knowledge-based approach which identifies and saves important functional genes before filtering based on variability and fold change differences was utilized to study light regulation. Two clustering methods were used to cluster the filtered datasets, and clusters containing a key light regulatory gene were located. The common genes to both of these clusters were identified, and the genes in the common cluster were ranked based on their coexpression to the key gene. This process was repeated for 11 key genes in 3 treatment combinations. The initial filtering method reduced the dataset size from 22,814 probes to an average of 1134 genes, and the resulting common cluster lists contained an average of only 14 genes. These common cluster lists scored higher gene enrichment scores than two individual clustering methods. In addition, the filtering method increased the proportion of light responsive genes in the dataset from 1.8% to 15.2%, and the cluster lists increased this proportion to 18.4%. The relatively short length of these common cluster lists compared to gene groups generated through typical clustering methods or coexpression networks narrows the search for novel functional genes while increasing the likelihood that they are biologically relevant.  相似文献   

7.
MOTIVATION: Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. RESULTS: We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.  相似文献   

8.
Gasch AP  Eisen MB 《Genome biology》2002,3(11):research0059.1-research005922
  相似文献   

9.
The study of conserved gene clusters is important for understanding the forces behind genome organization and evolution, as well as the function of individual genes or gene groups. In this paper, we present a new model and algorithm for identifying conserved gene clusters from pairwise genome comparison. This generalizes a recent model called "gene teams." A gene team is a set of genes that appear homologously in two or more species, possibly in a different order yet with the distance of adjacent genes in the team for each chromosome always no more than a certain threshold. We remove the constraint in the original model that each gene must have a unique occurrence in each chromosome and thus allow the analysis on complex prokaryotic or eukaryotic genomes with extensive paralogs. Our algorithm analyzes a pair of chromosomes in O(mn) time and uses O(m+n) space, where m and n are the number of genes in the respective chromosomes. We demonstrate the utility of our methods by studying two bacterial genomes, E. coli K-12 and B. subtilis. Many of the teams identified by our algorithm correlate with documented E. coli operons, while several others match predicted operons, previously suggested by computational techniques. Our implementation and data are publicly available at euler.slu.edu/ approximately goldwasser/homologyteams/.  相似文献   

10.
MOTIVATION: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters. RESULTS: We present the Gustafson-Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://dragon.kaist.ac.kr/gk.  相似文献   

11.
A fundamental problem in DNA microarray analysis is the lack of a common standard to compare the expression levels of different samples. Several normalization protocols have been proposed to overcome variables inherent in this technology. As yet, there are no satisfactory methods to exchange gene expression data among different research groups or to compare gene expression values under different stimulus–response profiles. We have tested a normalization procedure based on comparing gene expression levels to the signals generated from hybridizing genomic DNA (genomic normalization). This procedure was applied to DNA microarrays of Mycobacterium tuberculosis using RNA extracted from cultures growing to the logarithmic and stationary phases. The applied normalization procedure generated reproducible measurements of expression level for 98% of the putative mycobacterial ORFs, among which 5.2% were significantly changed comparing the logarithmic to stationary growth phase. Additionally, analysis of expression levels of a subset of genes by real time PCR technology revealed an agreement in expression of 90% of the examined genes when genomic DNA normalization was applied instead of 29–68% agreement when RNA normalization was used to measure the expression levels in the same set of RNA samples. Further examination of microarray expression levels displayed clusters of genes differentially expressed between the logarithmic, early stationary and late stationary growth phases. We conclude that genomic DNA standards offer advantages over conventional RNA normalization procedures and can be adapted for the investigation of microbial genomes.  相似文献   

12.
通过不同的聚类方式,对公共数据库中生物序列数据进行生物信息的挖掘,以达到在更广泛和更深入的框架中了解它们之间的相互关系的目的。以帕金森病相关基因所对应的mRNA序列为例,使用双序列比对的得分值作为序列之间的距离定义。同时为解决不同聚类分析之间的差异,分别采用模糊聚类和层次聚类两种不同的方法进行聚类分析。并由不同聚类方法得到的一致分类聚类的结果为基因功能分类提供支持,为进一步揭示生物序列所蕴涵的生物学知识和生物学规律提供可参考的依据。  相似文献   

13.

Background  

Many different cluster methods are frequently used in gene expression data analysis to find groups of co-expressed genes. However, cluster algorithms with the ability to visualize the resulting clusters are usually preferred. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results.  相似文献   

14.
MOTIVATION: With the emergence of genome-wide expression profiling data sets, the guilt by association (GBA) principle has been a cornerstone for deriving gene functional interpretations in silico. Given the limited success of traditional methods for producing clusters of genes with great amounts of functional similarity, new data-mining algorithms are required to fully exploit the potential of high-throughput genomic approaches. RESULTS: Ontology-based pattern identification (OPI) is a novel data-mining algorithm that systematically identifies expression patterns that best represent existing knowledge of gene function. Instead of relying on a universal threshold of expression similarity to define functionally related groups of genes, OPI finds the optimal analysis settings that yield gene expression patterns and gene lists that best predict gene function using the principle of GBA. We applied OPI to a publicly available gene expression data set on the life cycle of the malarial parasite Plasmodium falciparum and systematically annotated genes for 320 functional categories based on current Gene Ontology annotations. An ontology-based hierarchical tree of the 320 categories provided a systems-wide biological view of this important malarial parasite.  相似文献   

15.
MOTIVATION: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. RESULTS: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression. AVAILABILITY: GaneSh, a Java package for coclustering, is available under the terms of the GNU General Public License from our website at http://bioinformatics.psb.ugent.be/software  相似文献   

16.
Identifying which genes and which gene sets are differentially expressed (DE) under two experimental conditions are both key questions in microarray analysis. Although closely related and seemingly similar, they cannot replace each other, due to their own importance and merits in scientific discoveries. Existing approaches have been developed to address only one of the two questions. Further, most of the methods for detecting DE genes purely rely on gene expression analysis, without using the information about gene functional grouping. Methods for detecting altered gene sets often use a two-step procedure, of which the first step conducts differential expression analysis using expression data only, and the second step takes results from the first step and tries to examine whether each predefined gene set is overrepresented by DE genes through some testing procedure. Such a sequential manner in analysis might cause information loss by just focusing on summary results without using the entire expression data in the second step. Here, we propose a Bayesian joint modeling approach to address the two key questions in parallel, which incorporates the information of functional annotations into expression data analysis and meanwhile infer the enrichment of functional groups. Simulation results and analysis of experimental data obtained for E.?coli show improved statistical power of our integrated approach in both identifying DE genes and altered gene sets, when compared to conventional methods.  相似文献   

17.
Directed indices for exploring gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Large expression studies with clinical outcome data are becoming available for analysis. An important goal is to identify genes or clusters of genes where expression is related to patient outcome. While clustering methods are useful data exploration tools, they do not directly allow one to relate the expression data to clinical outcome. Alternatively, methods which rank genes based on their univariate significance do not incorporate gene function or relationships to genes that have been previously identified. In addition, after sifting through potentially thousands of genes, summary estimates (e.g. regression coefficients or error rates) algorithms should address the potentially large bias introduced by gene selection. RESULTS: We developed a gene index technique that generalizes methods that rank genes by their univariate associations to patient outcome. Genes are ordered based on simultaneously linking their expression both to patient outcome and to a specific gene of interest. The technique can also be used to suggest profiles of gene expression related to patient outcome. A cross-validation method is shown to be important for reducing bias due to adaptive gene selection. The methods are illustrated on a recently collected gene expression data set based on 160 patients with diffuse large cell lymphoma (DLCL).  相似文献   

18.
The class I and II major histocompatibility complex (MHC) genes are apparently subject to evolution by a birth-and-death process. The rate of gene turnover is much slower in the latter genes than in the former. In placental mammals, the class II region can be subdivided into different orthologous subregions or gene clusters (DR, DQ, DO, and DN), but the origins and evolutionary relationships of these gene clusters are not well established. Here we report the results of our study of the times of origin and evolutionary relationships of these gene clusters in mammals. Our analysis suggests that both class II alpha-chain and beta-chain gene clusters are shared by placental mammals and marsupials, but the gene clusters from nonmammalian species are paralogous to mammalian gene clusters. We estimated the times of divergence between gene clusters in placental mammals using the linearized tree and distance regression methods. Our results indicate that most gene clusters originated 170-200 million years (MY) ago, but that DO beta-chain genes diverged from the other beta-chain gene clusters approximately 210-260 MY ago. The phylogenetic trees for the alpha- and beta-chain genes were not congruent, suggesting that the evolutionary history of the class II gene clusters is more complex than previously thought.  相似文献   

19.
We describe for the first time functional clusters of genes that are modulated during the differentiation of osteoclasts. Pathway analysis was applied to gene array data generated from affymetrix chips hybridized to RNA isolated from RAW264.7 cells exposed to RANK-ligand (RANK-L) for 5 days. This analysis revealed major functional gene clusters that were either up- or down-regulated during osteoclastogenesis. Some of the genes within the clusters have known functions, while others do not. We discuss herein the relevance of these functional gene clusters and their modulation to biological processes underlying the formation, function, and fate of osteoclasts.  相似文献   

20.
Recent whole-genome studies and in-depth expressed sequence tag (EST) analyses have identified most of the developmentally relevant genes in the urochordate, Ciona intestinalis. In this study, we made use of a large-scale oligo-DNA microarray to further investigate and identify genes with specific or correlated expression profiles, and we report global gene expression profiles for about 66% of all the C. intestinalis genes that are expressed during its life cycle. We succeeded in categorizing the data set into 5 large clusters and 49 sub-clusters based on the expression profile of each gene. This revealed the higher order of gene expression profiles during the developmental and aging stages. Furthermore, a combined analysis of microarray data with the EST database revealed the gene groups that were expressed at a specific stage or in a specific organ of the adult. This study provides insights into the complex structure of ascidian gene expression, identifies co-expressed gene groups and marker genes and makes predictions for the biological roles of many uncharacterized genes. This large-scale oligo-DNA microarray for C. intestinalis should facilitate the understanding of global gene expression and gene networks during the development and aging of a basal chordate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号