首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Target selection strategies for structural genomic projects must be able to prioritize gene regions on the basis of significant sequence similarity with proteins that have already been structurally determined. With the rapid development of protein comparison software a robust prioritization scheme should be independent of the choice of algorithm and be able to incorporate different sequence similarity thresholds. RESULTS: A robust target selection strategy has been developed that can assign a priority level to all genes in any genome. Structural assignments to genome sequences are calculated at two thresholds and six levels (1-6) describe the prioritization of all whole genes and partial gene regions. This simple two-threshold approach can be implemented with any fold recognition or homology detection algorithms. The results for 10 genomes are presented using the SSEARCH and PSI-BLAST programs. AVAILABILITY: Programs are available on request from the authors.  相似文献   

2.
We have developed a new method and program, SARF2, for fast comparison of protein structures, which can detect topological as well as nontopological similarities. The method searches for large ensembles of secondary structure elements, which are mutually compatible in two proteins. These ensembles consist of small fragments of Cα-trace, similarly arranged in three-dimensional space in two proteins, but not necessarily equally-ordered along the polypeptide chains. The program SARF2 is available for everyone through the World-Wide Web (WWW). We have performed an exhaustive pairwise comparison of all the entries from a recent issue of the Protein Data Bank (PDB) and report here on the results of an automated hierarchical cluster analysis. In addition, we report on several new cases of significant structural resemblance between proteins. To this end, a new definition of the significance of structural similarity is introduced, which effectively distinguishes the biologically meaningful equivalences from those occurring by chance. Analyzing the distribution of sequence similarity in significant structural matches, we show that sequence similarity as low as 20% in structurally-prealigned proteins can be a strong indication for the biological relevance of structural similarity. © 1996 Wiley-Liss, Inc.  相似文献   

3.
Clustering protein sequences--structure prediction by transitive homology.   总被引:2,自引:0,他引:2  
MOTIVATION: It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood if transitivity always holds and whether transitivity can be extended ad infinitum. RESULTS: We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a directed graph, where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity, scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited though by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the self-similarity-of the scaling of the alignment scores appears to be an effective criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient library. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structural Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues. AVAILABILITY: The software is available to academic users on request from the authors. CONTACT: e.bolten@science-factory.com; schliep@zpr.uni-koeln.de; s.schneckener@science-factory.com; d.schomburg@uni-koeln.de; schrader@zpr.uni-koeln.de. SUPPLEMENTARY INFORMATION: http://www.zaik.uni-koeln.de/~schliep/ProtClust.html.  相似文献   

4.
MOTIVATION: The evolution of protein sequences can be described by a stepwise process, where each step involves changes of a few amino acids. In a similar manner, the evolution of protein folds can be at least partially described by an analogous process, where each step involves comparatively simple changes affecting few secondary structure elements. A number of such evolution steps, justified by biologically confirmed examples, have previously been proposed by other researchers. However, unlike the situation with sequences, as far as we know there have been no attempts to estimate the comparative probabilities for different kinds of such structural changes. RESULTS: We have tried to assess the comparative probabilities for a number of known structural changes, and to relate the probabilities of such changes with the distance between protein sequences. We have formalized these structural changes using a topological representation of structures (TOPS), and have developed an algorithm for measuring structural distances that involve few evolutionary steps. The probabilities of structural changes then were estimated on the basis of all-against-all comparisons of the sequence and structure of protein domains from the CATH-95 representative set. The results obtained are reasonably consistent for a number of different data subsets and permit the identification of several 'most popular' types of evolutionary changes in protein structure. The results also suggest that alterations in protein structure are more likely to occur when the sequence similarity is >10% (the average similarity being approximately 6% for the data sets employed in this study), and that the distribution of probabilities of structural changes is fairly uniform within the interval of 15-50% sequence similarity. AVAILABILITY: The algorithms have been implemented on the Windows operating system in C++ and using the Borland Visual Component Library. The source code is available on request from the first author. The data sets used for this study (representative sets of protein domains, matrices of sequence similarities and structural distances) are available on http://bioinf.mii.lu.lv/epsrc_project/struct_ev.html.  相似文献   

5.
6.
The FSSP database of structurally aligned protein fold families.   总被引:17,自引:0,他引:17       下载免费PDF全文
L Holm  C Sander 《Nucleic acids research》1994,22(17):3600-3609
FSSP (families of structurally similar proteins) is a database of structural alignments of proteins in the Protein Data Bank (PDB). The database currently contains an extended structural family for each of 330 representative protein chains. Each data set contains structural alignments of one search structure with all other structurally significantly similar proteins in the representative set (remote homologs, < 30% sequence identity), as well as all structures in the Protein Data Bank with 70-30% sequence identity relative to the search structure (medium homologs). Very close homologs (above 70% sequence identity) are excluded as they rarely have marked structural differences. The alignments of remote homologs are the result of pairwise all-against-all structural comparisons in the set of 330 representative protein chains. All such comparisons are based purely on the 3D co-ordinates of the proteins and are derived by automatic (objective) structure comparison programs. The significance of structural similarity is estimated based on statistical criteria. The FSSP database is available electronically from the EMBL file server and by anonymous ftp (file transfer protocol).  相似文献   

7.
MOTIVATION: Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs. RESULTS: We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.  相似文献   

8.
The wealth of protein sequence and structure data is greater than ever, thanks to the ongoing Genomics and Structural Genomics projects. The information available through such efforts needs to be analysed by new methods that combine both databases. One important result of genomic sequence analysis is the inference of functional homology among proteins. Until recently sequence similarity comparison was the only method for homologue inference. The new fold recognition approach reviewed in this paper enhances sequence comparison methods by including structural information in the process of protein comparison. This additional information often allows for the detection of similarities that cannot be found by methods that only use sequence information.  相似文献   

9.
A probabilistic measure for alignment-free sequence comparison   总被引:3,自引:0,他引:3  
MOTIVATION: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULTS: The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). AVAILABILITY: All datasets and computer codes written in MATLAB are available upon request from the first author.  相似文献   

10.
MOTIVATION: The number of protein families has been estimated to be as small as 1000. Recent study shows that the growth in discovery of novel structures that are deposited into PDB and the related rate of increase of SCOP categories are slowing down. This indicates that the protein structure space will be soon covered and thus we may be able to derive most of remaining structures by using the known folding patterns. Present tertiary structure prediction methods behave well when a homologous structure is predicted, but give poorer results when no homologous templates are available. At the same time, some proteins that share twilight-zone sequence identity can form similar folds. Therefore, determination of structural similarity without sequence similarity would be beneficial for prediction of tertiary structures. RESULTS: The proposed PFRES method for automated protein fold classification from low identity (<35%) sequences obtains 66.4% and 68.4% accuracy for two test sets, respectively. PFRES obtains 6.3-12.4% higher accuracy than the existing methods. The prediction accuracy of PFRES is shown to be statistically significantly better than the accuracy of competing methods. Our method adopts a carefully designed, ensemble-based classifier, and a novel, compact and custom-designed feature representation that includes nearly 90% less features than the representation of the most accurate competing method (36 versus 283). The proposed representation combines evolutionary information by using the PSI-BLAST profile-based composition vector and information extracted from the secondary structure predicted with PSI-PRED. AVAILABILITY: The method is freely available from the authors upon request.  相似文献   

11.
12.
Lin HN  Notredame C  Chang JM  Sung TY  Hsu WL 《PloS one》2011,6(12):e27872
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.  相似文献   

13.
GeneRAGE: a robust algorithm for sequence clustering and domain detection   总被引:9,自引:0,他引:9  
MOTIVATION: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. RESULTS: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith-Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith-Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multi-domain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. AVAILABILITY: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.  相似文献   

14.
MOTIVATION: Protein structure classification has been recognized as one of the most important research issues in protein structure analysis. A substantial number of methods for the classification have been proposed, and several databases have been constructed using these methods. Since some proteins with very similar sequences may exhibit structural diversities, we have proposed PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB), which strategy of selection is based not only on sequence similarity but also on structural similarity. Forty-eight representative sets whose similarity criteria were predetermined were made available over the World Wide Web (WWW). However, the sets were insufficient in number to satisfy users researching protein structures by various methods. RESULT: We have improved the system for PDB-REPRDB so that the user may obtain a quick selection of representative chains from PDB. The selection of representative chains can be dynamically configured according to the user's requirement. The WWW interface provides a large degree of freedom in setting parameters, such as cut-off scores of sequence and structural similarity. This paper describes the method we use to classify chains and select the representatives in the system. We also describe the interface used to set the parameters.  相似文献   

15.
Proteins that contain similar structural elements often have analogous functions regardless of the degree of sequence similarity or structure connectivity in space. In general, protein structure comparison (PSC) provides a straightforward methodology for biologists to determine critical aspects of structure and function. Here, we developed a novel PSC technique based on angle-distance image (A-D image) transformation and matching, which is independent of sequence similarity and connectivity of secondary structure elements (SSEs). An A-D image is constructed by utilizing protein secondary structure information. According to various types of SSEs, the mutual SSE pairs of the query protein are classified into three different types of sub-images. Subsequently, corresponding sub-images between query and target protein structures are compared using modified cross-correlation approaches to identify the similarity of various patterns. Structural relationships among proteins are displayed by hierarchical clustering trees, which facilitate the establishment of the evolutionary relationships between structure and function of various proteins.Four standard testing datasets and one newly created dataset were used to evaluate the proposed method. The results demonstrate that proteins from these five datasets can be categorized in conformity with their spatial distribution of SSEs. Moreover, for proteins with low sequence identity that share high structure similarity, the proposed algorithms are an efficient and effective method for structural comparison.  相似文献   

16.
MOTIVATION: Local structure segments (LSSs) are small structural units shared by unrelated proteins. They are extensively used in protein structure comparison, and predicted LSSs (PLSSs) are used very successfully in ab initio folding simulations. However, predicted or real LSSs are rarely exploited by protein sequence comparison programs that are based on position-by-position alignments. RESULTS: We developed a SEgment Alignment algorithm (SEA) to compare proteins described as a collection of predicted local structure segments (PLSSs), which is equivalent to an unweighted graph (network). Any specific structure, real or predicted corresponds to a specific path in this network. SEA then uses a network matching approach to find two most similar paths in networks representing two proteins. SEA explores the uncertainty and diversity of predicted local structure information to search for a globally optimal solution. It simultaneously solves two related problems: the alignment of two proteins and the local structure prediction for each of them. On a benchmark of protein pairs with low sequence similarity, we show that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods.  相似文献   

17.
Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ~1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/.  相似文献   

18.
Zhou H  Zhou Y 《Proteins》2005,58(2):321-328
Recognizing structural similarity without significant sequence identity has proved to be a challenging task. Sequence-based and structure-based methods as well as their combinations have been developed. Here, we propose a fold-recognition method that incorporates structural information without the need of sequence-to-structure threading. This is accomplished by generating sequence profiles from protein structural fragments. The structure-derived sequence profiles allow a simple integration with evolution-derived sequence profiles and secondary-structural information for an optimized alignment by efficient dynamic programming. The resulting method (called SP(3)) is found to make a statistically significant improvement in both sensitivity of fold recognition and accuracy of alignment over the method based on evolution-derived sequence profiles alone (SP) and the method based on evolution-derived sequence profile and secondary structure profile (SP(2)). SP(3) was tested in SALIGN benchmark for alignment accuracy and Lindahl, PROSPECTOR 3.0, and LiveBench 8.0 benchmarks for remote-homology detection and model accuracy. SP(3) is found to be the most sensitive and accurate single-method server in all benchmarks tested where other methods are available for comparison (although its results are statistically indistinguishable from the next best in some cases and the comparison is subjected to the limitation of time-dependent sequence and/or structural library used by different methods.). In LiveBench 8.0, its accuracy rivals some of the consensus methods such as ShotGun-INBGU, Pmodeller3, Pcons4, and ROBETTA. SP(3) fold-recognition server is available on http://theory.med.buffalo.edu.  相似文献   

19.
Studying similarities in protein molecules has become a fundamental activity in much of biology and biomedical research, for which methods such as multiple sequence alignments are widely used. Most methods available for such comparisons cater to studying proteins which have clearly recognizable evolutionary relationships but not to proteins that recognize the same or similar ligands but do not share similarities in their sequence or structural folds. In many cases, proteins in the latter class share structural similarities only in their binding sites. While several algorithms are available for comparing binding sites, there are none for deriving structural motifs of the binding sites, independent of the whole proteins. We report the development of SiteMotif, a new algorithm that compares binding sites from multiple proteins and derives sequence-order independent structural site motifs. We have tested the algorithm at multiple levels of complexity and demonstrate its performance in different scenarios. We have benchmarked against 3 current methods available for binding site comparison and demonstrate superior performance of our algorithm. We show that SiteMotif identifies new structural motifs of spatially conserved residues in proteins, even when there is no sequence or fold-level similarity. We expect SiteMotif to be useful for deriving key mechanistic insights into the mode of ligand interaction, predict the ligand type that a protein can bind and improve the sensitivity of functional annotation.  相似文献   

20.
Current analyses of protein sequence/structure relationships have focused on expected similarity relationships for structurally similar proteins. To survey and explore the basis of these relationships, we present a general sequence/structure map that covers all combinations of similarity/dissimilarity relationships and provide novel energetic analyses of these relationships. To aid our analysis, we divide protein relationships into four categories: expected/unexpected similarity (S and S(?)) and expected/unexpected dissimilarity (D and D(?)) relationships. In the expected similarity region S, we show that trends in the sequence/structure relation can be derived based on the requirement of protein stability and the energetics of sequence and structural changes. Specifically, we derive a formula relating sequence and structural deviations to a parameter characterizing protein stiffness; the formula fits the data reasonably well. We suggest that the absence of data in region S(?) (high structural but low sequence similarity) is due to unfavorable energetics. In contrast to region S, region D(?) (high sequence but low structural similarity) is well-represented by proteins that can accommodate large structural changes. Our analyses indicate that there are several categories of similarity relationships and that protein energetics provide a basis for understanding these relationships.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号