首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
    
In an era of rapid genome sequencing and high-throughput technology, automatic function prediction for a novel sequence is of utter importance in bioinformatics. While automatic annotation methods based on local alignment searches can be simple and straightforward, they suffer from several drawbacks, including relatively low sensitivity and assignment of incorrect annotations that are not associated with the region of similarity. ProtoNet is a hierarchical organization of the protein sequences in the UniProt database. Although the hierarchy is constructed in an unsupervised automatic manner, it has been shown to be coherent with several biological data sources. We extend the ProtoNet system in order to assign functional annotations automatically. By leveraging on the scaffold of the hierarchical classification, the method is able to overcome some frequent annotation pitfalls.  相似文献   

2.
    
PhosphaBase is an ontology-driven database resource containing information on the protein phosphatase family. It is the first public resource dedicated to protein phosphatases, which are enzymes that perform dephosphorylation reactions. In conjunction with the phosphorylation action of protein kinases, phosphatases are involved in important control and communication mechanisms in the cell. They have also been implicated in many human diseases, including diabetes and obesity, cancers, and neurodegenerative conditions. PhosphaBase aims to centralize the growing base of knowledge in the phosphatase research domain. The resource is built around a formal, domain-specific DAML+OIL ontology, and the data are collected from heterogeneous biological sources using Gene Ontology terms as a means of data extraction. The overall ontology-driven architecture provides a robust structure with distinct advantages for sustainability and provides the potential for the development of diagnostic tools, as well as a data repository.  相似文献   

3.
    
The Medium-Chain Dehydrogenase/Reductase Engineering Database (MDRED, http://www.mdred.uni-stuttgart.de) has been established to serve as an analysis tool for a systematic investigation of sequence-structure-function relationships. It includes sequence and structure information of 2684 and 42 medium-chain dehydrogenases/reductases (MDRs), respectively. Although MDRs are very diverse in sequence, they have a conserved tertiary structure. MDRs are assigned to 199 homologous families and 29 superfamilies. For each family, annotated multiple sequence alignments are provided, and functionally relevant residues are annotated. Twenty-five superfamilies were classified as zinc-containing MDRs, four as non-zinc-containing MDRs. For the zinc-containing MDRs, three subclasses were identified by systematic analysis of a variable loop region, the quaternary structure determining loop (QSDL): the class of short, medium, and long QSDL, which include 11, 3, and 5 superfamilies, respectively. The length of the QSDL is predictive for tetramer (short QSDL) and dimer (long QSDL) formation. The class of medium QSDL includes both tetrameric and dimeric MDRs. The shape of the substrate-binding site is highly conserved in all zinc-containing MDRs with the exception of two variable regions, the substrate recognition sites (SRS): two residues located on the QSDL (SRS1) and, for the class of long QSDL, one residue located in the catalytic domain (SRS2). The MDRED is the first online-accessible resource of MDRs that integrates information on sequence, structure, and function. Annotation of functionally relevant residues assist the understanding of sequence-structure-function relationships. Thus, the MDRED serves as a valuable tool to identify potential hotspots for engineering properties such as substrate specificity.  相似文献   

4.
InterPro, an integrated documentation resource for protein families, protein domains, and functional sites, was developed to amalgamate the individual efforts of the PROSITE, PRINTS, Pfam, and ProDom databases. InterPro can be used for the computational functional classification of newly determined amino acid sequences that lack biochemical characterization and for comparative genome analysis. InterPro contains over 3500 entries for more than 1 000 000 hits in SWISS-PROT and TrEMBL. The database is accessible for text-and sequence-based searches at http://www.ebi.ac.uk/interpro/. InterPro was used for the complete analysis of the proteome of the pathogenic microorganism Mycobacterium tuberculosis and the comparison with the predicted protein-coding sequences of the complete genomes of Bacillus subtilis and Escherichia coli. It was found that 64.8% of proteins in the proteome of M. tuberculosis matched InterPro entries and can be classified by their functions. The comparison with B. subtilis and E. coli provided information on the most common protein families and domains and on the most highly represented protein families in each organism. Thus, InterPro is a useful tool for general comparison of complete proteomes and their compositions.  相似文献   

5.
The ProDom database is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases. An associated database, ProDom-CG, has been derived as a restriction of ProDom to completely sequenced genomes. The ProDom construction method is based on iterative PSI-BLAST searches and multiple alignments are generated for each domain family. The ProDom web server provides the user with a set of tools to visualise multiple alignments, phylogenetic trees and domain architectures of proteins, as well as a BLAST-based server to analyse new sequences for homologous domains. The comprehensive nature of ProDom makes it particularly useful to help sustain the growth of InterPro.  相似文献   

6.
图聚类用于蛋白质分类问题可以获得较好结果,其前提是将蛋白质之间复杂的相互关系转化为适当的相似性网络作为图聚类分类的输入数据。本文提出一种基于BLAST检索的相似性网络构建方法,从目标蛋白质序列出发,通过若干轮次的BLAST检索逐步从数据库中提取与目标蛋白质直接或间接相关的序列,构成关联集。关联集中序列之间的相似性关系即相似性网络,可作为图聚类算法的分类依据。对Pfam数据库中依直接相似关系难以正确分类的蛋白质的计算表明,按本文方法构建的相似性网络取得了比较满意的结果。  相似文献   

7.
植物蛋白激酶研究进展   总被引:5,自引:0,他引:5  
近年来,在分子生物学技术不断完善和在酵母与动物蛋白激酶研究的基础上,植物蛋白激酶的研究已取得了很大的进展。就近十年来国内外学者对植物蛋白激酶的发现,家族分类,磷酸化过程及其生理功能等方面的研究进行综述。最后分析了存在的问题并对今后的研究提出了展望。  相似文献   

8.
    
The Short-chain Dehydrogenases/Reductases Engineering Database (SDRED) covers one of the largest known protein families (168 150 proteins). Assignment to the superfamilies of Classical and Extended SDRs was achieved by global sequence similarity and by identification of family-specific sequence motifs. Two standard numbering schemes were established for Classical and Extended SDRs that allow for the determination of conserved amino acid residues, such as cofactor specificity determining positions or superfamily specific sequence motifs. The comprehensive sequence dataset of the SDRED facilitates the refinement of family-specific sequence motifs. The glycine-rich motifs for Classical and Extended SDRs were refined to improve the precision of superfamily classification. In each superfamily, the majority of sequences formed a tightly connected sequence network and belonged to a large homologous family. Despite their different sequence motifs and their different sequence length, the two sequence networks of Classical and Extended SDRs are not separate, but connected by edges at a threshold of 40% sequence similarity, indicating that all SDRs belong to a large, connected network. The SDRED is accessible at https://sdred.biocatnet.de/.  相似文献   

9.
    
Cai CZ  Han LY  Ji ZL  Chen YZ 《Proteins》2004,55(1):66-76
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.  相似文献   

10.
    
Bostick DL  Shen M  Vaisman II 《Proteins》2004,56(3):487-501
A topological representation of proteins is developed that makes use of two metrics: the Euclidean metric for identifying natural nearest neighboring residues via the Delaunay tessellation in Cartesian space and the distance between residues in sequence space. Using this representation, we introduce a quantitative and computationally inexpensive method for the comparison of protein structural topology. The method ultimately results in a numerical score quantifying the distance between proteins in a heuristically defined topological space. The properties of this scoring scheme are investigated and correlated with the standard Calpha distance root-mean-square deviation measure of protein similarity calculated by rigid body structural alignment. The topological comparison method is shown to have a characteristic dependence on protein conformational differences and secondary structure. This distinctive behavior is also observed in the comparison of proteins within families of structural relatives. The ability of the comparison method to successfully classify proteins into classes, superfamilies, folds, and families that are consistent with standard classification methods, both automated and human-driven, is demonstrated. Furthermore, it is shown that the scoring method allows for a fine-grained classification on the family, protein, and species level that agrees very well with currently established phylogenetic hierarchies. This fine classification is achieved without requiring visual inspection of proteins, sequence analysis, or the use of structural superimposition methods. Implications of the method for a fast, automated, topological hierarchical classification of proteins are discussed.  相似文献   

11.
  总被引:7,自引:3,他引:7  
  相似文献   

12.
Protein classification artificial neural system.   总被引:2,自引:0,他引:2       下载免费PDF全文
A neural network classification method is developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.  相似文献   

13.
Mimicking cellular sorting improves prediction of subcellular localization   总被引:27,自引:0,他引:27  
Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we present estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast.  相似文献   

14.
A database of 926 (α + β)-proteins and (α + β)-domains containing abCd units, including 401 non-homologous, was compiled from the Protein Data Bank (total of 2636 PDB entries). A novel structural tree of this structural class of proteins was constructed to include 286 possible polypeptide chain folds. A structural classification of (α + β)-proteins containing abCd unit was developed on the basis of the structural tree. The database and the structural tree are available at http://strees.protres.ru/.  相似文献   

15.
    
Sael L  Li B  La D  Fang Y  Ramani K  Rustamov R  Kihara D 《Proteins》2008,72(4):1259-1273
  相似文献   

16.
Presented here is the development a semi-rational protein engineering approach that uses information from protein structure coupled with established DNA manipulation techniques to design and create multiple crossover libraries from non-homologous genes. The utility of structure-based combinatorial protein engineering (SCOPE) was demonstrated by its application to two distantly related members of the X-family of DNA polymerases: rat DNA polymerase beta (Pol beta) and African swine fever virus DNA polymerase X (Pol X). These proteins share similar folds but have low sequence identity, and differ greatly in both size and activity. "Equivalent" subdomain elements of structure were designed on the basis of the tertiary structure of Pol beta and the corresponding regions of Pol X were inferred from homology modeling and sequence alignment analysis. Libraries of chimeric genes with up to five crossovers were synthesized in a series of PCR reactions by employing hybrid oligonucleotides that code for variable connections between structural elements. Genetic complementation in Escherichia coli enabled identification of several novel DNA polymerases with enhanced phenotypes. Both the composition of structural elements and the manner in which they were linked were shown to be essential for this property, indicating the importance of these aspects of design.  相似文献   

17.
18.
    
Nick V. Grishin 《Proteins》2015,83(7):1238-1251
ECOD (Evolutionary Classification Of protein Domains) is a comprehensive and up‐to‐date protein structure classification database. The majority of new structures released from the PDB (Protein Data Bank) each week already have close homologs in the ECOD hierarchy and thus can be reliably partitioned into domains and classified by software without manual intervention. However, those proteins that lack confidently detectable homologs require careful analysis by experts. Although many bioinformatics resources rely on expert curation to some degree, specific examples of how this curation occurs and in what cases it is necessary are not always described. Here, we illustrate the manual classification strategy in ECOD by example, focusing on two major issues in protein classification: domain partitioning and the relationship between homology and similarity scores. Most examples show recently released and manually classified PDB structures. We discuss multi‐domain proteins, discordance between sequence and structural similarities, difficulties with assessing homology with scores, and integral membrane proteins homologous to soluble proteins. By timely assimilation of newly available structures into its hierarchy, ECOD strives to provide a most accurate and updated view of the protein structure world as a result of combined computational and expert‐driven analysis. Proteins 2015; 83:1238–1251. © 2015 Wiley Periodicals, Inc.  相似文献   

19.
    
Racolta S  Juhl PB  Sirim D  Pleiss J 《Proteins》2012,80(8):2009-2019
Triterpene cyclases catalyze a broad range of cyclization reactions to form polycyclic triterpenes. Triterpene cyclases that convert squalene to hopene are named squalene-hopene cyclases (SHC) and triterpene cyclases that convert oxidosqualene are named oxidosqualene cyclases (OSC). Many sequences have been published, but there is only one structure available for each of SHCs and OSCs. Although they catalyze a similar reaction, the sequence similarity between SHCs and OSCs is low. A family classification based on phylogenetic analysis revealed 20 homologous families which are grouped into two superfamilies, SHCs and OSCs. Based on this family assignment, the Triterpene Cyclase Engineering Database (TTCED) was established. It integrates available information on sequence and structure of 639 triterpene cyclases as well as on structurally and functionally relevant amino acids. Family specific multiple sequence alignments were generated to identify the functionally relevant residues. Based on sequence alignments, conserved residues in SHCs and OSCs were analyzed and compared to experimentally confirmed mutational data. Functional schematic models of the central cavities of OSCs and SHCs were derived from structure comparison and sequence conservation analysis. These models demonstrate the high similarity of the substrate binding cavity of SHCs and OSCs and the equivalences of the respective residues. The TTCED is a novel source for comprehensive information on the triterpene cyclase family, including a compilation of previously described mutational data. The schematic models present the conservation analysis in a readily available fashion and facilitate the correlation of residues to a specific function or substrate interaction.  相似文献   

20.
Viruses differ markedly in their specificity toward host organisms. Here, we test the level of general sequence adaptation that viruses display toward their hosts. We compiled a representative data set of viruses that infect hosts ranging from bacteria to humans. We consider their respective amino acid and codon usages and compare them among the viruses and their hosts. We show that bacteria‐infecting viruses are strongly adapted to their specific hosts, but that they differ from other unrelated bacterial hosts. Viruses that infect humans, but not those that infect other mammals or aves, show a strong resemblance to most mammalian and avian hosts, in terms of both amino acid and codon preferences. In groups of viruses that infect humans or other mammals, the highest observed level of adaptation of viral proteins to host codon usages is for those proteins that appear abundantly in the virion. In contrast, proteins that are known to participate in host‐specific recognition do not necessarily adapt to their respective hosts. The implication for the potential of viral infectivity is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号