首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
2.
Co-evolution and co-adaptation in protein networks   总被引:2,自引:0,他引:2  
Juan D  Pazos F  Valencia A 《FEBS letters》2008,582(8):1225-1230
Interacting or functionally related proteins have been repeatedly shown to have similar phylogenetic trees. Two main hypotheses have been proposed to explain this fact. One involves compensatory changes between the two protein families (co-adaptation). The other states that the tree similarity may be an indirect consequence of the involvement of the two proteins in similar cellular process, which in turn would be reflected by similar evolutionary pressure on the corresponding sequences. There are published data supporting both propositions, and currently the available information is compatible with both hypotheses being true, in an scenario in which both sets of forces are shaping the tree similarity at different levels.  相似文献   

3.
MOTIVATION: Uncovering the protein-protein interaction network is a fundamental step in the quest to understand the molecular machinery of a cell. This motivates the search for efficient computational methods for predicting such interactions. Among the available predictors are those that are based on the co-evolution hypothesis "evolutionary trees of protein families (that are known to interact) are expected to have similar topologies". Many of these methods are limited by the fact that they can handle only a small number of protein sequences. Also, details on evolutionary tree topology are missing as they use similarity matrices in lieu of the trees. RESULTS: We introduce MORPH, a new algorithm for predicting protein interaction partners between members of two protein families that are known to interact. Our approach can also be seen as a new method for searching the best superposition of the corresponding evolutionary trees based on tree automorphism group. We discuss relevant facts related to the predictability of protein-protein interaction based on their co-evolution. When compared with related computational approaches, our method reduces the search space by approximately 3 x 10(5)-fold and at the same time increases the accuracy of predicting correct binding partners.  相似文献   

4.
With the huge increase of protein data, an important problem is to estimate, within a large protein family, the number of sensible subsets for subsequent in-depth structural, functional, and evolutionary analyses. To tackle this problem, we developed a new program, Secator, which implements the principle of an ascending hierarchical method using a distance matrix based on a multiple alignment of protein sequences. Dissimilarity values assigned to the nodes of a deduced phylogenetic tree are partitioned by a new stopping rule introduced to automatically determine the significant dissimilarity values. The quality of the clusters obtained by Secator is verified by a separate Jackknife study. The method is demonstrated on 24 large protein families covering a wide spectrum of structural and sequence conservation and its usefulness and accuracy with real biological data is illustrated on two well-studied protein families (the Sm proteins and the nuclear receptors).  相似文献   

5.
Families and the structural relatedness among globular proteins.   总被引:4,自引:3,他引:1       下载免费PDF全文
Protein structures come in families. Are families “closely knit” or “loosely knit” entities? We describe a measure of relatedness among polymer conformations. Based on weighted distance maps, this measure differs from existing measures mainly in two respects: (1) it is computationally fast, and (2) it can compare any two proteins, regardless of their relative chain lengths or degree of similarity. It does not require finding relative alignments. The measure is used here to determine the dissimilarities between all 12, 403 possible pairs of 158 diverse protein structures from the Brookhaven Protein Data Bank (PDB). Combined with minimal spanning trees and hierarchical clustering methods, this measure is used to define structural families. It is also useful for rapidly searching a dataset of protein structures for specific substructural motifs. By using an analogy to distributions of Euclidean distances, we find that protein families are not tightly knit entities.  相似文献   

6.

Background

It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.

Results

In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.

Conclusions

We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
  相似文献   

7.

Background

In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.

Results

We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.

Conclusions

The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information.The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.  相似文献   

8.

Background

Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families.

Results

The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function.

Conclusions

Our results demonstrate that the method we present here using a k- modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
  相似文献   

9.
10.
A floristic analysis of the lowland dipterocarp forests of Borneo   总被引:4,自引:0,他引:4  
Aim To (1) identify floristic regions in the lowland (below 500 m a.s.l.) tropical dipterocarp rain forest of Borneo based on tree genera, (2) determine the characteristic taxa of these regions, (3) study tree diversity patterns within Borneo, and (4) relate the floristic and diversity patterns to abiotic factors such as mean annual rainfall and geographical distance between plots. Location Lowland tropical dipterocarp rain forest of Borneo. Methods We used tree (diameter at breast height ≥ 9.8 cm) inventory data from 28 lowland dipterocarp rain forest locations throughout Borneo. From each location six samples of 640 individuals were drawn randomly. With these data we calculated a Sørensen and Steinhaus similarity matrix for the locations. These matrices were then used in an UPGMA clustering algorithm to determine the floristic relations between the locations (dendrogram). Principal coordinate analysis was used to ordinate the locations. Characteristic taxa for the identified floristic clusters were determined with the use of the INDVAL method of Dufrene & Legendre (1997) . Finally, Mantel analysis was applied to determine the influence of mean annual rainfall and geographical distance between plots on floristic composition. Results A total of 77 families and 363 genera were included in the analysis. On average a random sample of 640 trees from a lowland dipterocarp forest in Borneo contains 41.6 ± 3.8 families and 103.0 ± 12.7 genera. Diversity varied strongly on local scales. On a regional scale, diversity was found to be highest in south‐east Borneo and central Sarawak. The most common families were Dipterocarpaceae (21.9% of trees) and Euphorbiaceae (12.2% of trees). The most common genera were Shorea (12.3% of trees) and Syzygium (5.0% of trees). The 28 locations were clustered in geographically distinct floristic regions. This was related to the fact that floristic similarity depended strongly on the geographical distance between plots and similarity in mean annual rainfall. Conclusions We identified five main floristic regions within the lowland dipterocarp rain forests of Borneo, each of which had its own set of characteristic genera. Mean annual rainfall is an important factor in explaining differences in floristic composition between locations. The influence of geographical distance on floristic similarity between locations is probably related to the fact that abiotic factors change with distance between plots. Borneo's central mountain range generally forms an effective dispersal barrier for the lowland tree flora. Diversity patterns in Borneo are influenced by the mid‐domain effect, habitat size and the influence of past climatic changes (ice ages during the Pleistocene).  相似文献   

11.
Assessing reliability of gene clusters from gene expression data   总被引:5,自引:0,他引:5  
The rapid development of microarray technologies has raised many challenging problems in experiment design and data analysis. Although many numerical algorithms have been successfully applied to analyze gene expression data, the effects of variations and uncertainties in measured gene expression levels across samples and experiments have been largely ignored in the literature. In this article, in the context of hierarchical clustering algorithms, we introduce a statistical resampling method to assess the reliability of gene clusters identified from any hierarchical clustering method. Using the clustering trees constructed from the resampled data, we can evaluate the confidence value for each node in the observed clustering tree. A majority-rule consensus tree can be obtained, showing clusters that only occur in a majority of the resampled trees. We illustrate our proposed methods with applications to two published data sets. Although the methods are discussed in the context of hierarchical clustering methods, they can be applied with other cluster-identification methods for gene expression data to assess the reliability of any gene cluster of interest. Electronic Publication  相似文献   

12.
Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.  相似文献   

13.
Estimating the reliability of evolutionary trees   总被引:9,自引:1,他引:8  
Six protein sequences from the same 11 mammalian taxa were used to estimate the accuracy and reliability of phylogenetic trees using real, rather than simulated, data. A tree comparison metric was used to measure the increase in similarity of minimal trees as larger, randomly selected subsets of nucleotide positions were taken. The ratio of the observed to the expected number of incompatibilities for each nucleotide position (character) is a good predictor of the number of changes required at that position on the minimal (most-parsimonious) tree. This allows a higher weighting of nucleotide positions that have changed more slowly and should result in the minimal length tree converging to the correct tree as more sequences are obtained. An estimate was made of the smallest subset of trees that need to be considered to include the actual historical tree for a given set of data. It was concluded that it is possible to give a reasonable estimate of the reliability of the final tree, at least when several sequences are combined. With the present data, resolving the rodent- primate-lagomorph (rabbit) trichotomy is the least certain aspect of the final tree, followed then by establishing the position of dog. In our opinion, it is unreasonable to publish an evolutionary tree derived from sequence data without giving an idea of the reliability of the tree.   相似文献   

14.
Response regulators of bacterial sensory transduction systems generally consist of receiver module domains covalently linked to effector domains. The effector domains include DNA binding and/or catalytic units that are regulated by sensor kinase-catalyzed aspartyl phosphorylation within their receiver modules. Most receiver modules are associated with three distinct families of DNA binding domains, but some are associated with other types of DNA binding domains, with methylated chemotaxis protein (MCP) demethylases, or with sensor kinases. A few exist as independent entities which regulate their target systems by noncovalent interactions.In this study the molecular phylogenies of the receiver modules and effector domains of 49 fully sequenced response regulators and their homologues were determined. The three major, evolutionarily distinct, DNA binding domains found in response regulators were evaluated for their phylogenetic relatedness, and the phylogenetic trees obtained for these domains were compared with those for the receiver modules. Members of one family (family 1) of DNA binding domains are linked to large ATPase domains which usually function cooperatively in the activation of E. Coli 54-dependent promoters or their equivalents in other bacteria. Members of a second family (family 2) always function in conjunction with the E. Coli 70 or its equivalent in other bacteria. A third family of DNA binding domains (family 3) functions by an uncharacterized mechanism involving more than one a factor. These three domain families utilize distinct helix-turn-helix motifs for DNA binding.The phylogenetic tree of the receiver modules revealed three major and several minor clusters of these domains. The three major receiver module clusters (clusters 1, 2, and 3) generally function with the three major families of DNA binding domains (families 1, 2, and 3, respectively) to comprise three classes of response regulators (classes 1, 2, and 3), although several exceptions exist. The minor clusters of receiver modules were usually, but not always, associated with other types of effector domains. Finally, several receiver modules did not fit into a cluster. It was concluded that receiver modules usually diverged from common ancestral protein domains together with the corresponding effector domains, although domain shuffling, due to intragenic splicing and fusion, must have occurred during the evolution of some of these proteins.Multiple sequence alignments of the 49 receiver modules and their various types of effector domains, together with other homologous domains, allowed definition of regions of striking sequence similarity and degrees of conservation of specific residues. Sequence data were correlated with structure/function when such information was available. These studies should provide guides for extrapolation of results obtained with one response regulator to others as well as for the design of future structure/function analyses. Correspondence to: M.H. Saier, Jr.  相似文献   

15.
We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10(-60), produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.  相似文献   

16.
Permutations on strings representing gene clusters on genomes have been studied earlier by Uno and Yagiura (2000), Heber and Stoye (2001), Bergeron et al. (2002), Eres et al. (2003), and Schmidt and Stoye (2004) and the idea of a maximal permutation pattern was introduced by Eres et al. (2003). In this paper, we present a new tool for representation and detection of gene clusters in multiple genomes, using PQ trees (Booth and Leuker, 1976): this describes the inner structure and the relations between clusters succinctly, aids in filtering meaningful from apparently meaningless clusters, and also gives a natural and meaningful way of visualizing complex clusters. We identify a minimal consensus PQ tree and prove that it is equivalent to a maximal pi pattern (Eres et al., 2003) and each subgraph of the PQ tree corresponds to a nonmaximal permutation pattern. We present a general scheme to handle multiplicity in permutations and also give a linear time algorithm to construct the minimal consensus PQ tree. Further, we demonstrate the results on whole genome datasets. In our analysis of the whole genomes of human and rat, we found about 1.5 million common gene clusters but only about 500 minimal consensus PQ trees, with E. Coli K-12 and B. Subtilis genomes, we found only about 450 minimal consensus PQ trees out of about 15,000 gene clusters, and when comparing eight different Chloroplast genomes, we found only 77 minimal consensus PQ trees out of about 6,700 gene clusters. Further, we show specific instances of functionally related genes in two of the cases.  相似文献   

17.
In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online.  相似文献   

18.

Background

Most studies inferring species phylogenies use sequences from single copy genes or sets of orthologs culled from gene families. For taxa such as plants, with very high levels of gene duplication in their nuclear genomes, this has limited the exploitation of nuclear sequences for phylogenetic studies, such as those available in large EST libraries. One rarely used method of inference, gene tree parsimony, can infer species trees from gene families undergoing duplication and loss, but its performance has not been evaluated at a phylogenomic scale for EST data in plants.

Results

A gene tree parsimony analysis based on EST data was undertaken for six angiosperm model species and Pinus, an outgroup. Although a large fraction of the tentative consensus sequences obtained from the TIGR database of ESTs was assembled into homologous clusters too small to be phylogenetically informative, some 557 clusters contained promising levels of information. Based on maximum likelihood estimates of the gene trees obtained from these clusters, gene tree parsimony correctly inferred the accepted species tree with strong statistical support. A slight variant of this species tree was obtained when maximum parsimony was used to infer the individual gene trees instead.

Conclusion

Despite the complexity of the EST data and the relatively small fraction eventually used in inferring a species tree, the gene tree parsimony method performed well in the face of very high apparent rates of duplication.
  相似文献   

19.
The "A Disintegrin And Metalloproteinase" (ADAM) protein family and the "A Disintegrin-like And Metalloproteinase with ThromboSpondin motifs" (ADAMTS) protein family are two related families of human proteins. The similarities and differences between these two families have been investigated using phylogenetic trees and homology modeling. The phylogenetic analysis indicates that the two families are well differentiated, even when only the common metalloprotease domain is taken into account. Within the ADAM family, several proteins are lacking the binding motif for the catalytic zinc in the active site and thus presumably lack any catalytic activity. These proteins tend to cluster within the ADAM phylogenetic tree and are expressed in specific tissues, suggesting a functional differentiation. The present analysis allows us to propose the following: (i) ADAMTS proteins have a conserved role in the human organism as proteases, with some differentiation in terms of substrate specificity; (ii) ADAM proteins can act as proteases and/or mediators of intermolecular interactions; (iii) proteolytically active ADAMs tend to be more ubiquitously expressed than the inactive ones.  相似文献   

20.
Liu J  Hegyi H  Acton TB  Montelione GT  Rost B 《Proteins》2004,56(2):188-200
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号