首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches composed of non-polar residues. Simple, quantitative criteria are desirable for identifying transmembrane helices (TMs) that must be included into or should be excluded from start sequence segments in similarity searches aimed at finding distant homologues.

Results

We found that there are two types of TMs in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intra-membrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information.

Conclusion

For extending the homology concept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs (and a sufficient criterion for complex TMs) in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments.

Reviewers

This article was reviewed by Shamil Sunyaev, L. Aravind and Arcady Mushegian.  相似文献   

2.
The Pfam Protein Families Database   总被引:17,自引:0,他引:17       下载免费PDF全文
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.  相似文献   

3.
Protein domains are generally thought to correspond to units of evolution. New research raises questions about how such domains are defined with bioinformatics tools and sheds light on how evolution has enabled partial domains to be viable.With the rapid expansion in the number of determined protein sequences - over 92 million in UniProt in March 2015 - an ever-increasing number of biologists are using bioinformatics tools for annotation of these sequences. One widely used strategy is to identify occurrences of Pfam families within the sequence of interest [1]. A Pfam family is a multiple sequence alignment of the occurrences of a particular domain both in different species and in different regions of the same protein. The concept underpinning Pfam is that proteins typically comprise one or more domains (regions), each of which is an evolutionary unit that generally has a well-defined biological function. A significant sequence similarity between a query protein and a Pfam family provides the basis for annotations. Two recent articles [2,3] in Genome Biology evaluate the implications of having the query sequence only matching part of a Pfam family, which is an intriguing finding, given that a Pfam family is considered to be an evolutionary unit.  相似文献   

4.
MOTIVATION: Multi-domain proteins have evolved by insertions or deletions of distinct protein domains. Tracing the history of a certain domain combination can be important for functional annotation of multi-domain proteins, and for understanding the function of individual domains. In order to analyze the evolutionary history of the domains in modular proteins it is desirable to inspect a phylogenetic tree based on sequence divergence with the modular architecture of the sequences superimposed on the tree. RESULT: A Java applet, NIFAS, that integrates graphical domain schematics for each sequence in an evolutionary tree was developed. NIFAS retrieves domain information from the Pfam database and uses CLUSTAL W to calculate a tree for a given Pfam domain. The tree can be displayed with symbolic bootstrap values, and to allow the user to focus on a part of the tree, the layout can be altered by swapping nodes, changing the outgroup, and showing/collapsing subtrees. NIFAS is integrated with the Pfam database and is accessible over the internet (http://www.cgr.ki.se/Pfam). As an example, we use NIFAS to analyze the evolution of domains in Protein Kinases C.  相似文献   

5.
The vesicular glutamate transporters (VGLUTs) are responsible for packaging glutamate into synaptic vesicles, and are part of a family of structurally related proteins that mediate organic anion transport. Standard computer-based predictions of transmembrane domains have led to divergent topological models, indicating the need for experimentally derived predictions. Here we present data on the topology of the VGLUT ortholog from Drosophila melanogaster (DVGLUT). Using immunofluorescence assays of DVGLUT transiently localized to the plasma membrane of heterologously transfected cells, we have determined the accessibility of epitope tags inserted into the lumenal/extracellular face of the protein. Using immunoisolation, we have identified complementary tagged sites that face the cytoplasm. Our data show that DVGLUT contains 10 hydrophobic regions that completely span the membrane (TMs 1-10) and that the amino and carboxyl termini are cytosolic. Importantly, between TMs 4 and 5 is an unforeseen cytosolic loop of some 50 residues. Other domains exposed to the cytosol include loops between TMs 6-7 and 8-9, and regions C-terminal to TM2 and N-terminal to TM3. Between TM2 and 3 is a potentially hydrophobic, but topologically ambiguous region. Lumenal domains include sequences between TMs 1-2, 3-4, 5-6, 7-8 and 9-10. These data provide a basis for determining structure-function relationships for DVGLUT and other related proteins.  相似文献   

6.
BackgroundProtein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2).ResultsWe characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts.ConclusionsPartial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein’s gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0656-7) contains supplementary material, which is available to authorized users.  相似文献   

7.
TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. The scope of each model is set by raising or lowering cutoff scores and choosing members of the seed alignment to group proteins sharing specific function (equivalog) or more general properties. The overall goal is to provide information with maximum utility for the annotation process. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The database currently contains over 1600 protein families. TIGRFAMs is available for searching or downloading at www.tigr.org/TIGRFAMs.  相似文献   

8.
9.
Pfam contains multiple alignments and hidden Markov model based profiles (HMM-profiles) of complete protein domains. The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members. Release 2.0 of Pfam contains 527 manually verified families which are available for browsing and on-line searching via the World Wide Web in the UK at http://www.sanger.ac.uk/Pfam/ and in the US at http://genome.wustl. edu/Pfam/ Pfam 2.0 matches one or more domains in 50% of Swissprot-34 sequences, and 25% of a large sample of predicted proteins from the Caenorhabditis elegans genome.  相似文献   

10.
11.
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.  相似文献   

12.
Identification of bacterial and archaeal counterparts to eukaryotic ion channels has greatly facilitated studies of structural biophysics of the channels. Often, searches based only on sequence alignment tools are inadequate for discovering such distant bacterial and archaeal counterparts. We address the discovery of bacterial and archaeal members of the Pentameric Ligand-Gated Ion Channel (pLGIC) family by a combination of four computational methods. One domain-based method involves retrieval of proteins with pLGIC-relevant domains by matching those domains to previously established domain templates in the InterPro family of databases. The second domain-based method involves searches using ungapped de-novo motifs discovered by MEME which were trained with well characterized members of the pLGIC family. The third and fourth methods involve the use of two sequence alignment search algorithms BLASTp and psiBLAST respectively. The sequences returned from all methods were screened by having the correct topology for pLGIC's, and by returning an annotated member of this family as one of the first ten hits using BLASTp against a comprehensive database of eukaryotic proteins. We found the domain based searches to have high specificity but low sensitivity, while the sequence alignment methods have higher sensitivity but lower specificity. The four methods together discovered 69 putative bacterial and archaeal members of the pLGIC family. We ranked and divide the 69 proteins into groups according to the similarity of their domain compositions with known eukaryotic pLGIC's. One especially notable group is more closely related to eukaryotic pLGIC's than to any other known protein family, and has the overall topology of pLGIC's, but the functional domains they contain are sufficiently different from those found in known pLGIC's that they do not score very well against the pLGIC domain templates. We conclude that multiple methods used in a coordinated fashion outperform any single method for identifying likely distant bacterial and archaeal proteins that may provide useful models for important eukaryotic channel function. We note also that the methods used here are largely standard and readily accessible. The novelty is in the effectiveness of a strategy that combines these methods for identifying bacterial and archea relatives of this family. Therefore the paper may serve as a template for a broad group of workers to reliably identify bacterial and archaeal counterparts to eukaryotic proteins.  相似文献   

13.
Although the intact chaperonin machinery is needed to rescue natural substrate proteins (SPs) under non-permissive conditions the "minichaperone" alone, containing only the isolated apical domain of GroEL, can assist folding of a certain class of proteins. To understand the annealing function of the minichaperone, we have carried out molecular dynamics simulations in the NPT ensemble totaling 300ns for four systems; namely, the isolated strongly binding peptide (SBP), the minichaperone, and the SBP and a weakly binding peptide (WBP) in complex with the minichaperone. The SBP, which is structureless in isolation, adopts a beta-hairpin conformation in complex with the minichaperone suggesting that favorable non-specific interactions of the SPs confined to helices H and I of the apical domains can induce local secondary structures. Comparison of the dynamical fluctuations of the apo and the liganded forms of the minichaperone shows that the stability (needed for SP capture) involves favorable hydrophobic interactions and hydrogen bond network formation between the SBP and WBP, and helices H and I. The release of the SP, which is required for the annealing action, involves water-mediated interactions of the charged residues at the ends of H and I helices. The simulation results are consistent with a transient binding release (TBR) model for the annealing action of the minichaperone. According to the TBR model, SP annealing occurs in two stages. In the first stage the SP is captured by the apical domain. This is followed by SP release (by thermal fluctuations) that places it in a different region of the energy landscape from which it can partition rapidly to the native state with probability Phi or be trapped in another misfolded state. The process of binding and release can result in enhancement of the native state yield. The TBR model suggests "that any cofactor that can repeatedly bind and release SPs can be effective in assisting protein folding." By comparing the structures of the non-chaperone alpha-casein (which has no sequence similarity with the apical domain) and the minichaperone and the hydrophobicity profiles we show that alpha-casein has a pair of helices that have similar sequence and structural profiles as H and I. Based on this comparison we identify residues that stabilize (destabilize) alpha-casein-protein complexes. This suggests that alpha-casein assists folding by the TBR mechanism.  相似文献   

14.
STI1‐domains are present in a variety of co‐chaperone proteins and are required for the transfer of hydrophobic clients in various cellular processes. The domains were first identified in the yeast Sti1 protein where they were referred to as DP1 and DP2. Based on hidden Markov model searches, this domain had previously been found in other proteins including the mammalian co‐chaperone SGTA, the DNA damage response protein Rad23, and the chloroplast import protein Tic40. Here, we refine the domain definition and carry out structure‐based sequence alignment of STI1‐domains showing conservation of five amphipathic helices. Upon examinations of these identified domains, we identify a preceding helix 0 and unifying sequence properties, determine new molecular models, and recognize that STI1‐domains nearly always occur in pairs. The similarity at the sequence, structure, and molecular levels likely supports a unified functional role.  相似文献   

15.
MOTIVATION: Since protein domains are the units of evolution, databases of domain signatures such as ProDom or Pfam enable both a sensitive and selective sequence analysis. However, manually curated databases have a low coverage and automatically generated ones often miss relationships which have not yet been discovered between domains or cannot display similarities between domains which have drifted apart. METHODS: We present a tool which makes use of the fact that overall domain arrangements are often conserved. AIDAN (Automated Improvement of Domain ANnotations) identifies potential annotation artifacts and domains which have drifted apart. The underlying database supplements ProDom and is interfaced by a graphical tool allowing the localization of single domain deletions or annotations which have been falsely made by the automated procedure. AVAILABILITY: http://www.uni-muenster.de/Evolution/ebb/Services/AIDAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

16.
We have previously shown that the Saccharomyces cerevisiae cell adhesion protein alpha-agglutinin has sequence characteristics of immunoglobulin-like proteins and have successfully modeled residues 200-325, based on the structure of immunoglobulin variable-type domains. Alignments matching residues 20-200 of alpha-agglutinin with domains I and II of members of the CD2/CD4 subfamily of the immunoglobulin superfamily showed > 80% conservation of key residues despite low sequence similarity overall. Three-dimensional models of two alpha-agglutinin domains constructed on the basis of these alignments were shown to conform to peptide mapping data and biophysical properties of alpha-agglutinin. In addition, the residue volume and surface accessibility characteristics of these models resembled those of the well-packed structures of related proteins. Residue-by-residue analysis showed that packing and accessibility anomalies were largely confined to glycosylated and protease-susceptible loop regions of the domains. Surface accessibility of hydrophobic residues was typical of proteins with extensive domain interactions, a finding compatible with the hydrodynamic properties of alpha -agglutinin and the hydrophobic nature of binding to its peptide ligand alpha-agglutinin. The procedures used to align the alpha-agglutinin sequence and test the quality of the model may be applicable to other proteins, especially those that resist crystallization because of extensive glycosylation.  相似文献   

17.
Simulating the change of protein sequences over time in a biologically realistic way is fundamental for a broad range of studies with a focus on evolution. It is, thus, problematic that typically simulators evolve individual sites of a sequence identically and independently. More realistic simulations are possible; however, they are often prohibited by limited knowledge concerning site-specific evolutionary constraints or functional dependencies between amino acids. As a consequence, a protein's functional and structural characteristics are rapidly lost in the course of simulated evolution. Here, we present REvolver (www.cibiv.at/software/revolver), a program that simulates protein sequence alteration such that evolutionarily stable sequence characteristics, like functional domains, are maintained. For this purpose, REvolver recruits profile hidden Markov models (pHMMs) for parameterizing site-specific models of sequence evolution in an automated fashion. pHMMs derived from alignments of homologous proteins or protein domains capture information regarding which sequence sites remained conserved over time and where in a sequence insertions or deletions are more likely to occur. Thus, they describe constraints on the evolutionary process acting on these sequences. To demonstrate the performance of REvolver as well as its applicability in large-scale simulation studies, we evolved the entire human proteome up to 1.5 expected substitutions per site. Simultaneously, we analyzed the preservation of Pfam and SMART domains in the simulated sequences over time. REvolver preserved 92% of the Pfam domains originally present in the human sequences. This value drops to 15% when traditional models of amino acid sequence evolution are used. Thus, REvolver represents a significant advance toward a realistic simulation of protein sequence evolution on a proteome-wide scale. Further, REvolver facilitates the simulation of a protein family with a user-defined domain architecture at the root.  相似文献   

18.
Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 +/- 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.  相似文献   

19.
Identification of bacterial and archaeal counterparts to eukaryotic ion channels has greatly facilitated studies of structural biophysics of the channels. Often, searches based only on sequence alignment tools are inadequate for discovering such distant bacterial and archaeal counterparts. We address the discovery of bacterial and archaeal members of the Pentameric Ligand-Gated Ion Channel (pLGIC) family by a combination of four computational methods. One domain-based method involves retrieval of proteins with pLGIC-relevant domains by matching those domains to previously established domain templates in the InterPro family of databases. The second domain-based method involves searches using ungapped de-novo motifs discovered by MEME which were trained with well characterized members of the pLGIC family. The third and fourth methods involve the use of two sequence alignment search algorithms BLASTp and psiBLAST respectively. The sequences returned from all methods were screened by having the correct topology for pLGIC's, and by returning an annotated member of this family as one of the first ten hits using BLASTp against a comprehensive database of eukaryotic proteins. We found the domain based searches to have high specificity but low sensitivity, while the sequence alignment methods have higher sensitivity but lower specificity. The four methods together discovered 69 putative bacterial and archaeal members of the pLGIC family. We ranked and divide the 69 proteins into groups according to the similarity of their domain compositions with known eukaryotic pLGIC's. One especially notable group is more closely related to eukaryotic pLGIC's than to any other known protein family, and has the overall topology of pLGIC's, but the functional domains they contain are sufficiently different from those found in known pLGIC's that they do not score very well against the pLGIC domain templates. We conclude that multiple methods used in a coordinated fashion outperform any single method for identifying likely distant bacterial and archaeal proteins that may provide useful models for important eukaryotic channel function. We note also that the methods used here are largely standard and readily accessible. The novelty is in the effectiveness of a strategy that combines these methods for identifying bacterial and archea relatives of this family. Therefore the paper may serve as a template for a broad group of workers to reliably identify bacterial and archaeal counterparts to eukaryotic proteins.  相似文献   

20.
Database searches can fail to detect all truly homologous sequences, particularly when dealing with short, highly sequence diverse protein families. Here, using microtubule interacting and transport (MIT) domains as an example, we have applied an approach of profile-profile matching followed by ab initio structure modelling to the detection of true homologues in the borderline significant zone of database searches. Novel MIT domains were confidently identified in USP54, containing an apparently inactive ubiquitin carboxyl-terminal hydrolase domain, a katanin-like ATPase KATNAL1, and an uncharacterized protein containing a VPS9 domain. As a proof of principle, we have confirmed the novel MIT annotation for USP54 by in vitro profiling of binding to CHMP proteins.

Structured summary

USP8 binds:CHMPs 1A 1B 2A 2B 4CUSP54 binds:CHMPs 1B 2A 2B 4C 6  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号