首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Structural genomics projects aim to solve a large number of protein structures with the ultimate objective of representing the entire protein space. The computational challenge is to identify and prioritize a small set of proteins with new, currently unknown, superfamilies or folds. RESULTS: We develop a method that assigns each protein a likelihood of it belonging to a new, yet undetermined, structural superfamily. The method relies on a variant of ProtoNet, an automatic hierarchical classification scheme of all protein sequences from SwissProt. Our results show that proteins that are remote from solved structures in the ProtoNet hierarchy are more likely to belong to new superfamilies. The results are validated against SCOP releases from recent years that account for about half of the solved structures known to date. We show that our new method and the representation of ProtoNet are superior in detecting new targets, compared to our previous method using ProtoMap classification. Furthermore, our method outperforms PSI-BLAST search in detecting potential new superfamilies.  相似文献   

2.

Background

It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.

Results

In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.

Conclusions

We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
  相似文献   

3.
A C May 《Proteins》1999,37(1):20-29
Recently, several hierarchical classifications of protein three-dimensional (3D) structures have been published. However, none of them provides any assessment of the validity of a hierarchical representation or test individual clusters contained within. In fact, testing here of published trees reveals that they vary in meaning. Protein structure similarity measures are then assessed in terms of the robustness of the resulting trees for 24 protein families. A meaningful tree is defined as one in which all the clusters are found to be reliable according to a jackknife test. With the use of this criterion, a previously published similarity measure described as a "better RMS" is shown in fact to be usually less suited to protein fold classification than normal RMS after superposition. Here the "best" protein structure similarity measure for hierarchical classification-in terms of that which after clustering produces the highest number of meaningful trees, 20, for the 24 families-is found to be a new one. This measure includes information on the relationship of a distance at a given aligned position in a pair to the rest of the unique distances at that position in a protein family. There are only 2 families of the 24 tested, the globins (3 trees) and Kazal-type serine proteinase inhibitors (21 trees), in which the topology (branching order) of the meaningful 3D structure-based trees is constant. Thus, a new view of protein family sequence-structure relationships is afforded by comparing meaningful trees for each family. More generally, there is a need for care in interpretation of the results of those molecular biology algorithms that force a tree structure on data without assessing its applicability. Proteins 1999;37:20-29.  相似文献   

4.
5.
ABSTRACT: BACKGROUND: The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. RESULTS: Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results are obtained in about a day. CONCLUSIONS: This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.  相似文献   

6.
Grouping the 20 residues is a classic strategy to discover ordered patterns and insights about the fundamental nature of proteins, their structure, and how they fold. Usually, this categorization is based on the biophysical and/or structural properties of a residue's side-chain group. We extend this approach to understand the effects of side chains on backbone conformation and to perform a knowledge-based classification of amino acids by comparing their backbone phi, psi distributions in different types of secondary structure. At this finer, more specific resolution, torsion angle data are often sparse and discontinuous (especially for nonhelical classes) even though a comprehensive set of protein structures is used. To ensure the precision of Ramachandran plot comparisons, we applied a rigorous Bayesian density estimation method that produces continuous estimates of the backbone phi, psi distributions. Based on this statistical modeling, a robust hierarchical clustering was performed using a divergence score to measure the similarity between plots. There were seven general groups based on the clusters from the complete Ramachandran data: nonpolar/beta-branched (Ile and Val), AsX (Asn and Asp), long (Met, Gln, Arg, Glu, Lys, and Leu), aromatic (Phe, Tyr, His, and Cys), small (Ala and Ser), bulky (Thr and Trp), and, lastly, the singletons of Gly and Pro. At the level of secondary structure (helix, sheet, turn, and coil), these groups remain somewhat consistent, although there are a few significant variations. Besides the expected uniqueness of the Gly and Pro distributions, the nonpolar/beta-branched and AsX clusters were very consistent across all types of secondary structure. Effectively, this consistency across the secondary structure classes implies that side-chain steric effects strongly influence a residue's backbone torsion angle conformation. These results help to explain the plasticity of amino acid substitutions on protein structure and should help in protein design and structure evaluation.  相似文献   

7.
Yona G  Linial N  Linial M 《Proteins》1999,37(3):360-378
We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.-all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domain-based classifications for between 64.8% and 88.5% of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families.  相似文献   

8.
On the hierarchical classification of G protein-coupled receptors   总被引:1,自引:0,他引:1  
MOTIVATION: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs. RESULTS: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases.  相似文献   

9.
SUMMARY: The R package HCGene (Hierarchical Classification of Genes) implements methods to process and analyze the Gene Ontology and the FunCat taxonomy in order to support the functional classification of genes. HCGene allows the extraction of subgraphs and subtrees related to specific biological problems, the labeling of genes and gene products with multiple and hierarchical functional classes, and the association of different types of bio-molecular data to genes for learning to predict their functions. AVAILABILITY: http://homes.dsi.unimi.it/~valenti/SW/hcgene/download/hcgene_1.0.tar.gz.  相似文献   

10.
11.
An evolving hierarchical family classification for glycosyltransferases   总被引:4,自引:0,他引:4  
Glycosyltransferases are a ubiquitous group of enzymes that catalyse the transfer of a sugar moiety from an activated sugar donor onto saccharide or non-saccharide acceptors. Although many glycosyltransferases catalyse chemically similar reactions, presumably through transition states with substantial oxocarbenium ion character, they display remarkable diversity in their donor, acceptor and product specificity and thereby generate a potentially infinite number of glycoconjugates, oligo- and polysaccharides. We have performed a comprehensive survey of glycosyltransferase-related sequences (over 7200 to date) and present here a classification of these enzymes akin to that proposed previously for glycoside hydrolases, into a hierarchical system of families, clans, and folds. This evolving classification rationalises structural and mechanistic investigation, harnesses information from a wide variety of related enzymes to inform cell biology and overcomes recurrent problems in the functional prediction of glycosyltransferase-related open-reading frames.  相似文献   

12.
We provide a decidable hierarchical classification of first-order recurrent neural networks made up of McCulloch and Pitts cells. This classification is achieved by proving an equivalence result between such neural networks and deterministic Büuchi automata, and then translating the Wadge classification theory from the abstract machine to the neural network context. The obtained hierarchy of neural networks is proved to have width 2 and height omega + 1, and a decidability procedure of this hierarchy is provided. Notably, this classification is shown to be intimately related to the attractive properties of the considered networks.  相似文献   

13.
Carbohydrate-active enzymes face huge substrate diversity in a highly selective manner using only a limited number of available folds. They are therefore subjected to multiple divergent and convergent evolutionary events. This and their frequent modularity render their functional annotation in genomes difficult in a number of cases. In the present paper, a classification of polysaccharide lyases (the enzymes that cleave polysaccharides using an elimination instead of a hydrolytic mechanism) is shown thoroughly for the first time. Based on the analysis of a large panel of experimentally characterized polysaccharide lyases, we examined the correlation of various enzyme properties with the three levels of the classification: fold, family and subfamily. The resulting hierarchical classification, which should help annotate relevant genes in genomic efforts, is available and constantly updated at the Carbohydrate-Active Enzymes Database (http://www.cazy.org).  相似文献   

14.
Visualizing large hierarchical clusters in hyperbolic space   总被引:9,自引:0,他引:9  
SUMMARY: HyperTree is an application to visualize and navigate large trees in hyperbolic space. It includes color-coding, search mechanisms and navigational aids, as well as focus+context viewing, allowing enormous trees to fit within the fixed space of a computer screen or printed page.  相似文献   

15.
The prediction of transmembrane (TM) helix and topology provides important information about the structure and function of a membrane protein. Due to the experimental difficulties in obtaining a high-resolution model, computational methods are highly desirable. In this paper, we present a hierarchical classification method using support vector machines (SVMs) that integrates selected features by capturing the sequence-to-structure relationship and developing a new scoring function based on membrane protein folding. The proposed approach is evaluated on low- and high-resolution data sets with cross-validation, and the topology (sidedness) prediction accuracy reaches as high as 90%. Our method is also found to correctly predict both the location of TM helices and the topology for 69% of the low-resolution benchmark set. We also test our method for discrimination between soluble and membrane proteins and achieve very low overall false positive (0.5%) and false negative rates (0 to approximately 1.2%). Lastly, the analysis of the scoring function suggests that the topogeneses of single-spanning and multispanning TM proteins have different levels of complexity, and the consideration of interloop topogenic interactions for the latter is the key to achieving better predictions. This method can facilitate the annotation of membrane proteomes to extract useful structural and functional information. It is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/SVMtop.  相似文献   

16.
17.
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.  相似文献   

18.
19.
20.
VISTRAJ is an application which allows 3D visualization, manipulation and editing of protein conformational space using probabilistic maps of this space called 'trajectory distributions'. Trajectory distributions serve as input to FOLDTRAJ which samples protein structures based on the represented conformational space. VISTRAJ also allows FOLDTRAJ to be used as a tool for homology model creation, and structures may be generated containing post-translationally modified amino acids. AVAILABILITY: Binaries are freely available for non-profit use as part of the FOLDTRAJ package at ftp://ftp.mshri.on.ca/pub/TraDES/foldtraj/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号