首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a 'first generation' search by querying a database. We propagate a 'second generation' search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this 'cascaded' intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein "fold space".  相似文献   

2.
Abstract

Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.  相似文献   

3.

Background

Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST.

Methodology/Principal Findings

We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families.

Conclusions/Significance

Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.  相似文献   

4.
An automatic sequence search and analysis protocol (DomainFinder) based on PSI-BLAST and IMPALA, and using conservative thresholds, has been developed for reliably integrating gene sequences from GenBank into their respective structural families within the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath_new). DomainFinder assigns a new gene sequence to a CATH homologous superfamily provided that PSI-BLAST identifies a clear relationship to at least one other Protein Data Bank sequence within that superfamily. This has resulted in an expansion of the CATH protein family database (CATH-PFDB v1.6) from 19,563 domain structures to 176,597 domain sequences. A further 50,000 putative homologous relationships can be identified using less stringent cut-offs and these relationships are maintained within neighbour tables in the CATH Oracle database, pending further evidence of their suggested evolutionary relationship. Analysis of the CATH-PFDB has shown that only 15% of the sequence families are close enough to a known structure for reliable homology modeling. IMPALA/PSI-BLAST profiles have been generated for each of the sequence families in the expanded CATH-PFDB and a web server has been provided so that new sequences may be scanned against the profile library and be assigned to a structure and homologous superfamily.  相似文献   

5.
MOTIVATION: Many studies have shown that database searches using position-specific score matrices (PSSMs) or profiles as queries are more effective at identifying distant protein relationships than are searches that use simple sequences as queries. One popular program for constructing a PSSM and comparing it with a database of sequences is Position-Specific Iterated BLAST (PSI-BLAST). RESULTS: This paper describes a new software package, IMPALA, designed for the complementary procedure of comparing a single query sequence with a database of PSI-BLAST-generated PSSMs. We illustrate the use of IMPALA to search a database of PSSMs for protein folds, and one for protein domains involved in signal transduction. IMPALA's sensitivity to distant biological relationships is very similar to that of PSI-BLAST. However, IMPALA employs a more refined analysis of statistical significance and, unlike PSI-BLAST, guarantees the output of the optimal local alignment by using the rigorous Smith-Waterman algorithm. Also, it is considerably faster when run with a large database of PSSMs than is BLAST or PSI-BLAST when run against the complete non-redundant protein database.  相似文献   

6.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

7.
Protein homology detection by HMM-HMM comparison   总被引:22,自引:4,他引:18  
MOTIVATION: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. RESULTS: We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.  相似文献   

8.
Iyer LM  Koonin EV  Aravind L 《Proteins》2001,43(2):134-144
With a protein structure comparison, an iterative database search with sequence profiles, and a multiple-alignment analysis, we show that two domains with the helix-grip fold, the star-related lipid-transfer (START) domain of the MLN64 protein and the birch allergen, are homologous. They define a large, previously underappreciated superfamily that we call the START superfamily. In addition to the classical START domains that are primarily involved in eukaryotic signaling mediated by lipid binding and the birch antigen family that consists of plant proteins implicated in stress/pathogen response, the START superfamily includes bacterial polyketide cyclases/aromatases (e.g., TcmN and WhiE VI) and two families of previously uncharacterized proteins. The identification of this domain provides a structural prediction of an important class of enzymes involved in polyketide antibiotic synthesis and allows the prediction of their active site. It is predicted that all START domains contain a similar ligand-binding pocket. Modifications of this pocket determine the ligand-binding specificity and may also be the basis for at least two distinct enzymatic activities, those of a cyclase/aromatase and an RNase. Thus, the START domain superfamily is a rare case of the adaptation of a protein fold with a conserved ligand-binding mode for both a broad variety of catalytic activities and noncatalytic regulatory functions. Proteins 2001;43:134-144.  相似文献   

9.
George RA  Heringa J 《Proteins》2002,48(4):672-681
Protein sequences containing more than one structural domain are problematic when used in homology searches where they can either stop an iterative database search prematurely or cause an explosion of a search to common domains. We describe a method, DOMAINATION, that infers domains and their boundaries in a query sequence from local gapped alignments generated using PSI-BLAST. Through a new technique to recognize domain insertions and permutations, DOMAINATION submits delineated domains as successive database queries in further iterative steps. Assessed over a set of 452 multidomain proteins, the method predicts structural domain boundaries with an overall accuracy of 50% and improves finding distant homologies by 14% compared with PSI-BLAST. DOMAINATION is available as a web based tool at http://mathbio.nimr.mrc.ac.uk, and the source code is available from the authors upon request.  相似文献   

10.
Horvath MM  Grishin NV 《Proteins》2001,42(2):230-236
Discovering distant evolutionary relationships between proteins requires detecting subtle similarities. Here we use a combination of sequence and structure analysis to show that the C-terminal domain of Escherichia coli HPII catalase with available spatial structure is a divergent member of the type I glutamine amidotransferase (GAT) superfamily. GAT-containing proteins include many biosynthetic enzymes such as E. coli carbamoyl phosphate synthetase and anthranilate synthase. Typical GAT domains have Rossmann fold-like topology and possess a catalytic triad similar to that of proteases. The C-terminal domain of HPII catalase has the GAT Rossmann fold but lacks the triad and therefore loses enzymatic activity. In addition, we detect significant sequence similarity between thiJ domains, some of which are known to have protease activity, and typical GAT proteins. Evolutionary tree analysis of the entire GAT superfamily indicates that the HPII catalase is more closely related to thiJ domains than to classical GAT domains and is likely to have evolved from a thiJ-like protein. This work illustrates the strength of sequence-based profile analysis techniques coupled with structural superpositions in developing an evolutionarily relevant classification of protein structures. Proteins 2001;42:230-236.  相似文献   

11.
During the course of our large-scale genome analysis a conserved domain, currently detectable only in the genomes of Drosophila melanogaster, Caenorhabditis elegans and Anopheles gambiae, has been identified. The function of this domain is currently unknown and no function annotation is provided for this domain in the publicly available genomic, protein family and sequence databases. The search for the homologues of this domain in the non-redundant sequence database using PSI-BLAST, resulted in identification of distant relationship between this family and the alkaline phosphatase-like superfamily, which includes families of aryl sulfatase, N-acetylgalactosomine-4-sulfatase, alkaline phosphatase and 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (iPGM). The fold recognition procedures showed that this new domain could adopt a similar 3-D fold as for this superfamily. Most of the phosphatases and sulfatases of this superfamily are characterized by functional residues Ser and Cys respectively in the topologically equivalent positions. This functionally important site aligns with Ser/Thr in the members of the new family. Additionally, set of residues responsible for a metal binding site in phosphatases and sulphtases are conserved in the new family. The in-depth analysis suggests that the new family could possess phosphatase activity.  相似文献   

12.
Seven protein structure comparison methods and two sequence comparison programs were evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The structure alignment programs Dali, Structal, Combinatorial Extension (CE), VAST, and Matras were tested along with SGM and PRIDE, which calculate a structural distance between two domains without aligning them. We also tested two sequence alignment programs, SSEARCH and PSI-BLAST. Depending upon the level of selectivity and error model, structure alignment programs can detect roughly twice as many homologous domains in CATH as sequence alignment programs. Dali finds the most homologs, 321-533 of 1120 possible true positives (28.7%-45.7%), at an error rate of 0.1 errors per query (EPQ), whereas PSI-BLAST finds 365 true positives (32.6%), regardless of the error model. At an EPQ of 1.0, Dali finds 42%-70% of possible homologs, whereas Matras finds 49%-57%; PSI-BLAST finds 36.9%. However, Dali achieves >84% coverage before the first error for half of the families tested. Dali and PSI-BLAST find 9.2% and 5.2%, respectively, of the 7056 possible topology pairs at an EPQ of 0.1 and 19.5, and 5.9% at an EPQ of 1.0. Most statistical significance estimates reported by the structural alignment programs overestimate the significance of an alignment by orders of magnitude when compared with the actual distribution of errors. These results help quantify the statistical distinction between analogous and homologous structures, and provide a benchmark for structure comparison statistics.  相似文献   

13.
Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.  相似文献   

14.
The recently developed PSI-BLAST method for sequence database search and methods for motif analysis were used to define and expand a superfamily of enzymes with an unusual nucleotide-binding fold, referred to as palmate, or ATP-grasp fold. In addition to D-alanine-D-alanine ligase, glutathione synthetase, biotin carboxylase, and carbamoyl phosphate synthetase, enzymes with known three-dimensional structures, the ATP-grasp domain is predicted in the ribosomal protein S6 modification enzyme (RimK), urea amidolyase, tubulin-tyrosine ligase, and three enzymes of purine biosynthesis. All these enzymes possess ATP-dependent carboxylate-amine ligase activity, and their catalytic mechanisms are likely to include acylphosphate intermediates. The ATP-grasp superfamily also includes succinate-CoA ligase (both ADP-forming and GDP-forming variants), malate-CoA ligase, and ATP-citrate lyase, enzymes with a carboxylate-thiol ligase activity, and several uncharacterized proteins. These findings significantly extend the variety of the substrates of ATP-grasp enzymes and the range of biochemical pathways in which they are involved, and demonstrate the complementarity between structural comparison and powerful methods for sequence analysis.  相似文献   

15.
MOTIVATION: Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS: In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity.  相似文献   

16.
The CATH database of protein structures contains approximately 18000 domains organized according to their (C)lass, (A)rchitecture, (T)opology and (H)omologous superfamily. Relationships between evolutionary related structures (homologues) within the database have been used to test the sensitivity of various sequence search methods in order to identify relatives in Genbank and other sequence databases. Subsequent application of the most sensitive and efficient algorithms, gapped blast and the profile based method, Position Specific Iterated Basic Local Alignment Tool (PSI-BLAST), could be used to assign structural data to between 22 and 36 % of microbial genomes in order to improve functional annotation and enhance understanding of biological mechanism. However, on a cautionary note, an analysis of functional conservation within fold groups and homologous superfamilies in the CATH database, revealed that whilst function was conserved in nearly 55% of enzyme families, function had diverged considerably, in some highly populated families. In these families, functional properties should be inherited far more cautiously and the probable effects of substitutions in key functional residues carefully assessed.  相似文献   

17.
18.
The O-linked GlcNAc transferases (OGTs) are a recently characterized group of largely eukaryotic enzymes that add a single beta-N-acetylglucosamine moiety to specific serine or threonine hydroxyls. In humans, this process may be part of a sugar regulation mechanism or cellular signaling pathway that is involved in many important diseases, such as diabetes, cancer, and neurodegeneration. However, no structural information about the human OGT exists, except for the identification of tetratricopeptide repeats (TPR) at the N terminus. The locations of substrate binding sites are unknown and the structural basis for this enzyme's function is not clear. Here, remote homology is reported between the OGTs and a large group of diverse sugar processing enzymes, including proteins with known structure such as glycogen phosphorylase, UDP-GlcNAc 2-epimerase, and the glycosyl transferase MurG. This relationship, in conjunction with amino acid similarity spanning the entire length of the sequence, implies that the fold of the human OGT consists of two Rossmann-like domains C-terminal to the TPR region. A conserved motif in the second Rossmann domain points to the UDP-GlcNAc donor binding site. This conclusion is supported by a combination of statistically significant PSI-BLAST hits, consensus secondary structure predictions, and a fold recognition hit to MurG. Additionally, iterative PSI-BLAST database searches reveal that proteins homologous to the OGTs form a large and diverse superfamily that is termed GPGTF (glycogen phosphorylase/glycosyl transferase). Up to one-third of the 51 functional families in the CAZY database, a glycosyl transferase classification scheme based on catalytic residue and sequence homology considerations, can be unified through this common predicted fold. GPGTF homologs constitute a substantial fraction of known proteins: 0.4% of all non-redundant sequences and about 1% of proteins in the Escherichia coli genome are found to belong to the GPGTF superfamily.  相似文献   

19.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.  相似文献   

20.
运用计算机进行核酸和蛋白质的序列分析是分子生物学研究的一个较新发展,这项技术已越来越多地用于研究大量积累的序列数据。蛋白质功能区是蛋白质分子中能独立折叠成具有一定结构并执行特定功能的结构域,所有具有同一类功能区的分子统称为一个蛋白质的超族(protein superfamily)。本文通过对免疫球蛋白(Ig)超族及其功能区序列所进行的分析,建立了一种根据功能区之保守片段残基组成的模式匹配分析检索蛋白质功能区的方法,它先根据多序列的对准比较确定某一类功能区之保守片段,再对已知的保守片段各位置上氨基酸残基组成进行统计分析,然后根据与统计数值相匹配的方法,计算待检序列残基组成的统计学意义,由此确定功能区的存在。该方法的优点在于它不仅可以检出已知的具有某一类功能区的分子,而且还可能发现新的具有该功能区的分子,从而推测后者的功能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号