首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 828 毫秒
1.
MOTIVATION: Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. RESULTS: We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.  相似文献   

2.
Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.  相似文献   

3.
4.
Many bioinformatics problems can be tackled from a fresh angle offered by the network perspective. Directly inspired by metabolic network structural studies, we propose an improved gene clustering approach for inferring gene signaling pathways from gene microarray data. Based on the construction of co-expression networks that consists of both significantly linear and non-linear gene associations together with controlled biological and statistical significance, our approach tends to group functionally related genes into tight clusters despite their expression dissimilarities. We illustrate our approach and compare it to the traditional clustering approaches on a yeast galactose metabolism dataset and a retinal gene expression dataset. Our approach greatly outperforms the traditional approach in rediscovering the relatively well known galactose metabolism pathway in yeast and in clustering genes of the photoreceptor differentiation pathway. AVAILABILITY: The clustering method has been implemented in an R package "GeneNT" that is freely available from: http://www.cran.org.  相似文献   

5.

Background

A goal of systems biology is to analyze large-scale molecular networks including gene expressions and protein-protein interactions, revealing the relationships between network structures and their biological functions. Dividing a protein-protein interaction (PPI) network into naturally grouped parts is an essential way to investigate the relationship between topology of networks and their functions. However, clear modular decomposition is often hard due to the heterogeneous or scale-free properties of PPI networks.

Methodology/Principal Findings

To address this problem, we propose a diffusion model-based spectral clustering algorithm, which analytically solves the cluster structure of PPI networks as a problem of random walks in the diffusion process in them. To cope with the heterogeneity of the networks, the power factor is introduced to adjust the diffusion matrix by weighting the transition (adjacency) matrix according to a node degree matrix. This algorithm is named adjustable diffusion matrix-based spectral clustering (ADMSC). To demonstrate the feasibility of ADMSC, we apply it to decomposition of a yeast PPI network, identifying biologically significant clusters with approximately equal size. Compared with other established algorithms, ADMSC facilitates clear and fast decomposition of PPI networks.

Conclusions/Significance

ADMSC is proposed by introducing the power factor that adjusts the diffusion matrix to the heterogeneity of the PPI networks. ADMSC effectively partitions PPI networks into biologically significant clusters with almost equal sizes, while being very fast, robust and appealing simple.  相似文献   

6.

Background  

In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.  相似文献   

7.
8.
An ensemble framework for clustering protein-protein interaction networks   总被引:3,自引:0,他引:3  
MOTIVATION: Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. RESULTS: In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

9.
10.
11.
MOTIVATION: Large scale gene expression data are often analysed by clustering genes based on gene expression data alone, though a priori knowledge in the form of biological networks is available. The use of this additional information promises to improve exploratory analysis considerably. RESULTS: We propose constructing a distance function which combines information from expression data and biological networks. Based on this function, we compute a joint clustering of genes and vertices of the network. This general approach is elaborated for metabolic networks. We define a graph distance function on such networks and combine it with a correlation-based distance function for gene expression measurements. A hierarchical clustering and an associated statistical measure is computed to arrive at a reasonable number of clusters. Our method is validated using expression data of the yeast diauxic shift. The resulting clusters are easily interpretable in terms of the biochemical network and the gene expression data and suggest that our method is able to automatically identify processes that are relevant under the measured conditions.  相似文献   

12.
《Fungal Biology Reviews》2018,32(4):249-264
Fungal model species have contributed to many aspects of modern biology, from biochemistry and cell biology to molecular genetics. Nevertheless, only a few genes associated with morphological development in fungi have been functionally characterized in terms of their genetic or molecular interactions. Evolutionary developmental biology in fungi faces challenges from a lack of fossil records and unresolved species phylogeny, to homoplasy associated with simple morphology. Traditionally, reductive approaches use genetic screens to reveal phenotypes from a large number of mutants; the efficiency of these approaches relies on profound prior knowledge of the genetics and biology of the designated development trait—knowledge which is often not available for even well-studied fungal model species. Reductive approaches become less efficient for the study of developmental traits that are regulated quantitatively by more than one gene via networks. Recent advances in genome-wide analysis performed in representative multicellular fungal models and non-models have greatly improved upon the traditional reductive approaches in fungal evo-devo research by providing clues for focused knockout strategies. In particular, genome-wide gene expression data across developmental processes of interest in multiple species can expedite the advancement of integrative synthetic and systems biology strategies to reveal regulatory networks underlying fungal development.  相似文献   

13.
Network responses to DNA damaging agents   总被引:4,自引:0,他引:4  
Begley TJ  Samson LD 《DNA Repair》2004,3(8-9):1123-1132
  相似文献   

14.
Protein--protein interaction maps: a lead towards cellular functions   总被引:6,自引:0,他引:6  
The availability of complete genome sequences now permits the development of tools for functional biology on a proteomic scale. Several experimental approaches or in silico algorithms aim at clustering proteins into networks with biological significance. Among those, the yeast two-hybrid system is the technology of choice to detect protein-protein interactions. Recently, optimized versions were applied at a genomic scale, leading to databases on the web. However, as with any other 'genetic' assay, yeast two-hybrid assays are prone to false positives and false negatives. Here we discuss these various technologies, their general limitations and the potential advances they make possible, especially when in combination with other functional genomics or bioinformatics analyses.  相似文献   

15.
Variation among individuals is a prerequisite of evolution by natural selection. As such, identifying the origins of variation is a fundamental goal of biology. We investigated the link between gene interactions and variation in gene expression among individuals and species using the mammalian limb as a model system. We first built interaction networks for key genes regulating early (outgrowth; E9.5–11) and late (expansion and elongation; E11-13) limb development in mouse. This resulted in an Early (ESN) and Late (LSN) Stage Network. Computational perturbations of these networks suggest that the ESN is more robust. We then quantified levels of the same key genes among mouse individuals and found that they vary less at earlier limb stages and that variation in gene expression is heritable. Finally, we quantified variation in gene expression levels among four mammals with divergent limbs (bat, opossum, mouse and pig) and found that levels vary less among species at earlier limb stages. We also found that variation in gene expression levels among individuals and species are correlated for earlier and later limb development. In conclusion, results are consistent with the robustness of the ESN buffering among-individual variation in gene expression levels early in mammalian limb development, and constraining the evolution of early limb development among mammalian species.  相似文献   

16.
MOTIVATION: Determining orthology relations among genes across multiple genomes is an important problem in the post-genomic era. Identifying orthologous genes can not only help predict functional annotations for newly sequenced or poorly characterized genomes, but can also help predict new protein-protein interactions. Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs. Traditional approaches have relied on pairwise sequence comparisons to construct graphs, which were then partitioned into putative clusters of orthologous groups. These methods do not attempt to preserve the non-transitivity and hierarchic nature of the orthology relation. RESULTS: We propose a new method, COCO-CL, for hierarchical clustering of homology relations and identification of orthologous groups of genes. Unlike previous approaches, which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individual genes in a more global context. COCO-CL can be used as a semi-independent method to delineate the orthology/paralogy relation for a refined set of homologous proteins obtained using a less-conservative clustering approach, or as a refiner that removes putative out-paralogs from clusters computed using a more inclusive approach. We analyze our clustering results manually, with support from literature and functional annotations. Since our orthology determination procedure does not employ a species tree to infer duplication events, it can be used in situations when the species tree is unknown or uncertain. CONTACT: jothi@mail.nih.gov, przytyck@mail.nih.gov SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online.  相似文献   

17.
MOTIVATION: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. RESULTS: First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings.  相似文献   

18.
MOTIVATION: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters. RESULTS: We present the Gustafson-Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://dragon.kaist.ac.kr/gk.  相似文献   

19.
Global protein function prediction from protein-protein interaction networks   总被引:20,自引:0,他引:20  
Determining protein function is one of the most challenging problems of the post-genomic era. The availability of entire genome sequences and of high-throughput capabilities to determine gene coexpression patterns has shifted the research focus from the study of single proteins or small complexes to that of the entire proteome. In this context, the search for reliable methods for assigning protein function is of primary importance. There are various approaches available for deducing the function of proteins of unknown function using information derived from sequence similarity or clustering patterns of co-regulated genes, phylogenetic profiles, protein-protein interactions (refs. 5-8 and Samanta, M.P. and Liang, S., unpublished data), and protein complexes. Here we propose the assignment of proteins to functional classes on the basis of their network of physical interactions as determined by minimizing the number of protein interactions among different functional categories. Function assignment is proteome-wide and is determined by the global connectivity pattern of the protein network. The approach results in multiple functional assignments, a consequence of the existence of multiple equivalent solutions. We apply the method to analyze the yeast Saccharomyces cerevisiae protein-protein interaction network. The robustness of the approach is tested in a system containing a high percentage of unclassified proteins and also in cases of deletion and insertion of specific protein interactions.  相似文献   

20.
The molecular complexity of mammalian proteomes demands new methods for mapping the organization of multiprotein complexes. Here, we combine mouse genetics and proteomics to characterize synapse protein complexes and interaction networks. New tandem affinity purification (TAP) tags were fused to the carboxyl terminus of PSD‐95 using gene targeting in mice. Homozygous mice showed no detectable abnormalities in PSD‐95 expression, subcellular localization or synaptic electrophysiological function. Analysis of multiprotein complexes purified under native conditions by mass spectrometry defined known and new interactors: 118 proteins comprising crucial functional components of synapses, including glutamate receptors, K+ channels, scaffolding and signaling proteins, were recovered. Network clustering of protein interactions generated five connected clusters, with two clusters containing all the major ionotropic glutamate receptors and one cluster with voltage‐dependent K+ channels. Annotation of clusters with human disease associations revealed that multiple disorders map to the network, with a significant correlation of schizophrenia within the glutamate receptor clusters. This targeted TAP tagging strategy is generally applicable to mammalian proteomics and systems biology approaches to disease.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号