首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online.  相似文献   

2.
SMART: a web-based tool for the study of genetically mobile domains   总被引:61,自引:2,他引:59  
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extra-cellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.  相似文献   

3.
The Swiss-Prot protein knowledgebase provides manually annotated entries for all species, but concentrates on the annotation of entries from model organisms to ensure the presence of high quality annotation of representative members of all protein families. A specific Plant Protein Annotation Program (PPAP) was started to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Its main goal is the annotation of proteins from the model plant organism Arabidopsis thaliana. In addition to bibliographic references, experimental results, computed features and sometimes even contradictory conclusions, direct links to specialized databases connect amino acid sequences with the current knowledge in plant sciences. As protein families and groups of plant-specific proteins are regularly reviewed to keep up with current scientific findings, we hope that the wealth of information of Arabidopsis origin accumulated in our knowledgebase, and the numerous software tools provided on the Expert Protein Analysis System (ExPASy) web site might help to identify and reveal the function of proteins originating from other plants. Recently, a single, centralized, authoritative resource for protein sequences and functional information, UniProt, was created by joining the information contained in Swiss-Prot, Translation of the EMBL nucleotide sequence (TrEMBL), and the Protein Information Resource-Protein Sequence Database (PIR-PSD). A rising problem is that an increasing number of nucleotide sequences are not being submitted to the public databases, and thus the proteins inferred from such sequences will have difficulties finding their way to the Swiss-Prot or TrEMBL databases.  相似文献   

4.
5.
Functional annotation is seldom straightforward with complexities arising due to functional divergence in protein families or functional convergence between non‐homologous protein families, leading to mis‐annotations. An enzyme may contain multiple domains and not all domains may be involved in a given function, adding to the complexity in function annotation. To address this, we use binding site information from bound cognate ligands and catalytic residues, since it can help in resolving fold‐function relationships at a finer level and with higher confidence. A comprehensive database of 2,020 fold‐function‐binding site relationships has been systematically generated. A network‐based approach is employed to capture the complexity in these relationships, from which different types of associations are deciphered, that identify versatile protein folds performing diverse functions, same function associated with multiple folds and one‐to‐one relationships. Binding site similarity networks integrated with fold, function, and ligand similarity information are generated to understand the depth of these relationships. Apart from the observed continuity in the functional site space, network properties of these revealed versatile families with topologically different or dissimilar binding sites and structural families that perform very similar functions. As a case study, subtle changes in the active site of a set of evolutionarily related superfamilies are studied using these networks. Tracing of such similarities in evolutionarily related proteins provide clues into the transition and evolution of protein functions. Insights from this study will be helpful in accurate and reliable functional annotations of uncharacterized proteins, poly‐pharmacology, and designing enzymes with new functional capabilities. Proteins 2017; 85:1319–1335. © 2017 Wiley Periodicals, Inc.  相似文献   

6.
The transition metals nickel and cobalt, essential components of many enzymes, are taken up by specific transport systems of several different types. We integrated in silico and in vivo methods for the analysis of various protein families containing both nickel and cobalt transport systems in prokaryotes. For functional annotation of genes, we used two comparative genomic approaches: identification of regulatory signals and analysis of the genomic positions of genes encoding candidate nickel/cobalt transporters. The nickel-responsive repressor NikR regulates many nickel uptake systems, though the NikR-binding signal is divergent in various taxonomic groups of bacteria and archaea. B(12) riboswitches regulate most of the candidate cobalt transporters in bacteria. The nickel/cobalt transporter genes are often colocalized with genes for nickel-dependent or coenzyme B(12) biosynthesis enzymes. Nickel/cobalt transporters of different families, including the previously known NiCoT, UreH, and HupE/UreJ families of secondary systems and the NikABCDE ABC-type transporters, showed a mosaic distribution in prokaryotic genomes. In silico analyses identified CbiMNQO and NikMNQO as the most widespread groups of microbial transporters for cobalt and nickel ions. These unusual uptake systems contain an ABC protein (CbiO or NikO) but lack an extracytoplasmic solute-binding protein. Experimental analysis confirmed metal transport activity for three members of this family and demonstrated significant activity for a basic module (CbiMN) of the Salmonella enterica serovar Typhimurium transporter.  相似文献   

7.
Membrane proteins serve as cellular gatekeepers, regulators, and sensors. Prior studies have explored the functional breadth and evolution of proteins and families of particular interest, such as the diversity of transport-associated membrane protein families in prokaryotes and eukaryotes, the composition of integral membrane proteins, and family classification of all human G-protein coupled receptors. However, a comprehensive analysis of the content and evolutionary associations between membrane proteins and families in a diverse set of genomes is lacking. Here, a membrane protein annotation pipeline was developed to define the integral membrane genome and associations between 21,379 proteins from 34 genomes; most, but not all of these proteins belong to 598 defined families. The pipeline was used to provide target input for a structural genomics project that successfully cloned, expressed, and purified 61 of our first 96 selected targets in yeast. Furthermore, the methodology was applied (1) to explore the evolutionary history of the substrate-binding transmembrane domains of the human ABC transporter superfamily, (2) to identify the multidrug resistance-associated membrane proteins in whole genomes, and (3) to identify putative new membrane protein families.  相似文献   

8.
9.
The fast development of next generation sequencing (NGS) has dramatically increased the application of metagenomics in various aspects. Functional annotation is a major step in the metagenomics studies. Fast annotation of functional genes has been a challenge because of the deluge of NGS data and expanding databases. A hybrid annotation pipeline proposed previously for taxonomic assignments was evaluated in this study for metagenomic sequences annotation of specific functional genes, such as antibiotic resistance genes, arsenic resistance genes and key genes in nitrogen metabolism. The hybrid approach using UBLAST and BLASTX is 44–177 times faster than direct BLASTX in the annotation using the small protein database for the specific functional genes, with the cost of missing a small portion (<1.8%) of target sequences compared with direct BLASTX hits. Different from direct BLASTX, the time required for specific functional genes annotation using the hybrid annotation pipeline depends on the abundance for the target genes. Thus this hybrid annotation pipeline is more suitable in specific functional genes annotation than in comprehensive functional genes annotation.  相似文献   

10.
判定直系同源关系的进化分析方法   总被引:1,自引:0,他引:1  
如何正确判定基因之间的直系同源 (ortholog)和旁系同源 (paralog)关系 ,仍是基因组功能诠释和比较基因组学中有待更好解决的关键问题。在以前的工作中 ,曾用进化分析方法解决多基因家族的直系 /旁系同源关系的判定问题 ,现进而完整地展开判定直系同源关系的进化分析方法。从 44个同源蛋白质家族的案例观察表明 ,与流行的COG方法 (直系同源蛋白质的聚类 )比较 ,本方法能一般的判定直系同源关系以及能准确的诠释基因组的分子功能  相似文献   

11.
Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term 'error percolation'. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.  相似文献   

12.
MOTIVATION: It is commonly believed that sequence determines structure, which in turn determines function. However, the presence of many proteins with the same structural fold but different functions suggests that global structure and function do not always correlate well. RESULTS: We propose a method for accurate functional annotation, based on identification of functional signatures from structural alignments (FSSA) using the Structural Classification of Proteins (SCOP) database. The FSSA method is superior at function discrimination and classification compared with several methods that directly inherit functional annotation information from homology inference, such as Smith-Waterman, PSI-BLAST, hidden Markov models and structure comparison methods, for a large number of structural fold families. Our results indicate that the contributions of amino acid residue types and positions to structure and function are largely separable for proteins in multi-functional fold families.  相似文献   

13.
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003   总被引:56,自引:4,他引:52  
The SWISS-PROT protein knowledgebase (http://www.expasy.org/sprot/ and http://www.ebi.ac.uk/swissprot/) connects amino acid sequences with the current knowledge in the Life Sciences. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions. Detailed expertise that goes beyond the scope of SWISS-PROT is made available via direct links to specialised databases. SWISS-PROT provides annotated entries for all species, but concentrates on the annotation of entries from human (the HPI project) and other model organisms to ensure the presence of high quality annotation for representative members of all protein families. Part of the annotation can be transferred to other family members, as is already done for microbes by the High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) project. Protein families and groups of proteins are regularly reviewed to keep up with current scientific findings. Complementarily, TrEMBL strives to comprise all protein sequences that are not yet represented in SWISS-PROT, by incorporating a perpetually increasing level of mostly automated annotation. Researchers are welcome to contribute their knowledge to the scientific community by submitting relevant findings to SWISS-PROT at swiss-prot@expasy.org.  相似文献   

14.
Evolution of the Rab family of small GTP-binding proteins.   总被引:33,自引:0,他引:33  
Rab proteins are small GTP-binding proteins that form the largest family within the Ras superfamily. Rab proteins regulate vesicular trafficking pathways, behaving as membrane-associated molecular switches. Here, we have identified the complete Rab families in the Caenorhabditis elegans (29 members), Drosophila melanogaster (29), Homo sapiens (60) and Arabidopsis thaliana (57), and we defined criteria for annotation of this protein family in each organism. We studied sequence conservation patterns and observed that the RabF motifs and the RabSF regions previously described in mammalian Rabs are conserved across species. This is consistent with conserved recognition mechanisms by general regulators and specific effectors. We used phylogenetic analysis and other approaches to reconstruct the multiplication of the Rab family and observed that this family shows a strict phylogeny of function as opposed to a phylogeny of species. Furthermore, we observed that Rabs co-segregating in phylogenetic trees show a pattern of similar cellular localisation and/or function. Therefore, animal and fungi Rab proteins can be grouped in "Rab functional groups" according to their segregating patterns in phylogenetic trees. These functional groups reflect similarity of sequence, localisation and/or function, and may also represent shared ancestry. Rab functional groups can help the understanding of the functional evolution of the Rab family in particular and vesicular transport in general, and may be used to predict general functions for novel Rab sequences.  相似文献   

15.
Ubiquitin E3 ligases are a diverse family of protein complexes that mediate the ubiquitination and subsequent proteolytic turnover of proteins in a highly specific manner. Among the several classes of ubiquitin E3 ligases, the Skp1-Cullin-F-box (SCF) class is generally comprised of three 'core' subunits: Skp1 and Cullin, plus at least one F-box protein (FBP) subunit that imparts specificity for the ubiquitination of selected target proteins. Recent genetic and biochemical evidence in Arabidopsis thaliana suggests that post-translational turnover of proteins mediated by SCF complexes is important for the regulation of diverse developmental and environmental response pathways. In this report, we extend upon a previous annotation of the Arabidopsis Skp1-like (ASK) and FBP gene families to include the Cullin family of proteins. Analysis of the protein interaction profiles involving the products of all three gene families suggests a functional distinction between ASK proteins in that selected members of the protein family interact generally while others interact more specifically with members of the F-box protein family. Analysis of the interaction of Cullins with FBPs indicates that CUL1 and CUL2, but not CUL3A, persist as components of selected SCF complexes, suggesting some degree of functional specialization for these proteins. Yeast two-hybrid analyses also revealed binary protein interactions between selected members of the FBP family in Arabidopsis. These and related results are discussed in terms of their implications for subunit composition, stoichiometry and functional diversity of SCF complexes in Arabidopsis.  相似文献   

16.
MOTIVATION: Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable. RESULTS: We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences. CONTACT: jill@cs.huji.ac.il; tishby@cs.huji.ac.il.  相似文献   

17.
MOTIVATION: Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006. AVAILABILITY: www.uniprot.org/mappingandproteininformation-resource.org/pirwww/search/idmapping.shtml CONTACT: huang@dbi.udel.edu.  相似文献   

18.
19.
20.
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号