首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
MOTIVATION: Sequence alignment techniques have been developed into extremely powerful tools for identifying the folding families and function of proteins in newly sequenced genomes. For a sufficiently low sequence identity it is necessary to incorporate additional structural information to positively detect homologous proteins. We have carried out an extensive analysis of the effectiveness of incorporating secondary structure information directly into the alignments for fold recognition and identification of distant protein homologs. A secondary structure similarity matrix based on a database of three-dimensionally aligned proteins was first constructed. An iterative application of dynamic programming was used which incorporates linear combinations of amino acid and secondary structure sequence similarity scores. Initially, only primary sequence information is used. Subsequently contributions from secondary structure are phased in and new homologous proteins are positively identified if their scores are consistent with the predetermined error rate. RESULTS: We used the SCOP40 database, where only PDB sequences that have 40% homology or less are included, to calibrate homology detection by the combined amino acid and secondary structure sequence alignments. Combining predicted secondary structure with sequence information results in a 8-15% increase in homology detection within SCOP40 relative to the pairwise alignments using only amino acid sequence data at an error rate of 0.01 errors per query; a 35% increase is observed when the actual secondary structure sequences are used. Incorporating predicted secondary structure information in the analysis of six small genomes yields an improvement in the homology detection of approximately 20% over SSEARCH pairwise alignments, but no improvement in the total number of homologs detected over PSI-BLAST, at an error rate of 0.01 errors per query. However, because the pairwise alignments based on combinations of amino acid and secondary structure similarity are different from those produced by PSI-BLAST and the error rates can be calibrated, it is possible to combine the results of both searches. An additional 25% relative improvement in the number of genes identified at an error rate of 0.01 is observed when the data is pooled in this way. Similarly for the SCOP40 dataset, PSI-BLAST detected 15% of all possible homologs, whereas the pooled results increased the total number of homologs detected to 19%. These results are compared with recent reports of homology detection using sequence profiling methods. AVAILABILITY: Secondary structure alignment homepage at http://lutece.rutgers.edu/ssas CONTACT: anders@rutchem.rutgers.edu; ronlevy@lutece.rutgers.edu Supplementary Information: Genome sequence/structure alignment results at http://lutece.rutgers.edu/ss_fold_predictions.  相似文献   

3.
Newly determined protein structures are classified to belong to a new fold, if the structures are sufficiently dissimilar from all other so far known protein structures. To analyze structural similarities of proteins, structure alignment tools are used. We demonstrate that the usage of nonsequential structure alignment tools, which neglect the polypeptide chain connectivity, can yield structure alignments with significant similarities between proteins of known three-dimensional structure and newly determined protein structures that possess a new fold. The recently introduced protein structure alignment tool, GANGSTA, is specialized to perform nonsequential alignments with proper assignment of the secondary structure types by focusing on helices and strands only. In the new version, GANGSTA+, the underlying algorithms were completely redesigned, yielding enhanced quality of structure alignments, offering alignment against a larger database of protein structures, and being more efficient. We applied DaliLite, TM-align, and GANGSTA+ on three protein crystal structures considered to be novel folds. Applying GANGSTA+ to these novel folds, we find proteins in the ASTRAL40 database, which possess significant structural similarities, albeit the alignments are nonsequential and in some cases involve secondary structure elements aligned in reverse orientation. A web server is available at http://agknapp.chemie.fu-berlin.de/gplus for pairwise alignment, visualization, and database comparison.  相似文献   

4.
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.  相似文献   

5.
MOTIVATION: Propagating functional annotations to sequence-similar, presumably homologous proteins lies at the heart of the bioinformatics industry. Correct propagation is crucially dependent on the accurate identification of subtle sequence motifs that are conserved in evolution. The evolutionary signal can be difficult to detect because functional sites may consist of non-contiguous residues while segments in-between may be mutated without affecting fold or function. RESULTS: Here, we report a novel graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments. This eliminates noise so that non-contiguous sequence motifs can be tracked down between extremely distant homologues. The novel data structure enables fast sequence database searching methods which are superior to profile-profile comparison at recognizing distant homologues. This study will boost the leverage of structural and functional genomics and opens up new avenues for data mining a complete set of functional signature motifs. AVAILABILITY: http://www.bioinfo.biocenter.helsinki.fi/gtg. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

6.

Background  

The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins.  相似文献   

7.
The epoxide hydrolases and haloalkane dehalogenases database (EH/HD) integrates sequence and structure of a highly diverse protein family, including mainly the Asp-hydrolases of EHs and HDs but also proteins, such as Ser-hydrolases non-heme peroxidases, prolyl iminopetidases and 2-hydroxymuconic semialdehyde hydrolases. These proteins have a highly conserved structure, but display a remarkable diversity in sequence and function. A total of 305 protein entries were assigned to 14 homologous families, forming two superfamilies. Annotated multisequence alignments and phylogenetic trees are provided for each homologous family and superfamily. Experimentally derived structures of 19 proteins are superposed and consistently annotated. Sequence and structure of all 305 proteins were systematically analysed. Thus, deeper insight is gained into the role of a highly conserved sequence motifs and structural elements. AVAILABILITY: The EH/HD database is available at http://www.led.uni-stuttgart.de  相似文献   

8.
运用计算机进行核酸和蛋白质的序列分析是分子生物学研究的一个较新发展,这项技术已越来越多地用于研究大量积累的序列数据。蛋白质功能区是蛋白质分子中能独立折叠成具有一定结构并执行特定功能的结构域,所有具有同一类功能区的分子统称为一个蛋白质的超族(protein superfamily)。本文通过对免疫球蛋白(Ig)超族及其功能区序列所进行的分析,建立了一种根据功能区之保守片段残基组成的模式匹配分析检索蛋白质功能区的方法,它先根据多序列的对准比较确定某一类功能区之保守片段,再对已知的保守片段各位置上氨基酸残基组成进行统计分析,然后根据与统计数值相匹配的方法,计算待检序列残基组成的统计学意义,由此确定功能区的存在。该方法的优点在于它不仅可以检出已知的具有某一类功能区的分子,而且还可能发现新的具有该功能区的分子,从而推测后者的功能。  相似文献   

9.
Structure comparison is widely used to quantify protein relationships. Although there are several approaches to calculate structural similarity, specifying significance thresholds for similarity metrics is difficult due to the inherent likeness of common secondary structure elements. In this study, metal co‐factor location is used to assess the biological relevance of structural alignments. The distance between the centroids of bound co‐factors adds a chemical and function‐relevant constraint to the structural superimposition of two proteins. This additional dimension can be used to define cut‐off values for discriminating valid and spurious alignments in large alignment sets. The hypothesis underlying our approach is that metal coordination sites constrain structural evolution, thus revealing functional relationships between distantly related proteins. A comparison of three related nitrogenases shows the sequence and fold constraints imposed on the protein structures up to 18 Å away from the centers of their bound metal clusters. Proteins 2014; 82:648–656. © 2013 Wiley Periodicals, Inc.  相似文献   

10.
The structure of Vibrio cholerae protein VC0424 was determined by NMR spectroscopy. VC0424 belongs to a conserved family of bacterial proteins of unknown function (COG 3076). The structure has an alpha-beta sandwich architecture consisting of two layers: a four-stranded antiparallel beta-sheet and three side-by-side alpha-helices. The secondary structure elements have the order alphabetaalphabetabetaalphabeta along the sequence. This fold is the same as the ferredoxin-like fold, except with an additional long N-terminal helix, making it a variation on this common motif. A cluster of conserved surface residues on the beta-sheet side of the protein forms a pocket that may be important for the biological function of this conserved family of proteins.  相似文献   

11.
RNA recognition: towards identifying determinants of specificity.   总被引:56,自引:0,他引:56  
Members of a family of proteins containing a conserved approximately 80-amino acid RNA recognition motif (RRM) bind specifically to a wide variety of RNA molecules. Structural studies, in combination with sequence alignments, indicate the structural context of both conserved and non-conserved elements in the motif. These analyses suggest that all RRM proteins share a common fold and a similar protein-RNA interface, and that non-conserved residues contribute additional contacts for sequence-specific RNA recognition.  相似文献   

12.
All tick proteins assigned to the lipocalin family lack the structural conserved regions (SCRs) that are characteristic of the kernel lipocalins and can thus be classified as outliers. These tick proteins have been assigned to the tick lipocalin family based on database searches that indicated homology between tick sequences and the fact that the histamine binding protein (HBP2) from the hard tick Rhipicephalus appendiculatus (Ixodidae) shows structural similarity to the lipocalin fold. Sequence identity between kernel and outlier lipocalins falls below 20% and the question raised is whether the outlier and kernel lipocalins are truly homologous. More specifically in the case of the tick lipocalins, whether their structural fold is derived from the lipocalin fold or whether convergent evolution resulted in the generation of the basic lipocalin-like fold which consists of an eight stranded continuous anti-parallel beta-barrel terminated by a C-terminal alpha-helix that lies parallel to the barrel. The current study determined the gene structure for HBP2 and TSGP1, TSGP2 and TSGP4, lipocalins identified from the soft tick Ornithodoros savignyi (Argasidae). All tick lipocalins have four introns (A-D) with conserved positions and phases within the tick lipocalin sequence alignment. The positions and phase information are also conserved with regard to the rest of the lipocalin family. Phylogenetic analysis using this information shows conclusively that tick lipocalins are evolutionary related to the rest of the lipocalin family. Tick lipocalins are grouped within a monophyletic clade that indicates a monophyletic origin within the tick lineage and also group with the other arthropod lipocalins in a larger clade. Phylogenetic analysis of sequence alignments based on conserved secondary structure of the lipocalin fold support the conclusions from the gene structure trees. These results indicate that exon-intron arrangement can be useful for the inclusion of outlier lipocalins within the larger lipocalin family.  相似文献   

13.
3D-Hit is a fast scanning method for detecting structural similarities between proteins. The algorithm is based on a hashing function, which decomposes proteins into segments of 13 residues. The scanning procedures start with assigning a set of similar segments from the database to each segment in the query protein. These initial hits are expanded by two iterations of structural superposition of larger segments of 99 and 299 residues. The method generates an alignment for the query protein by concatenating partial structural alignments.  相似文献   

14.
Multiple sequence alignments become biologically meaningful only if conserved and functionally important residues and secondary structural elements preserved can be identified at equivalent positions. This is particularly important for transmembrane proteins like G-protein coupled receptors (GPCRs) with seven transmembrane helices. TM-MOTIF is a software package and an effective alignment viewer to identify and display conserved motifs and amino acid substitutions (AAS) at each position of the aligned set of homologous sequences of GPCRs. The key feature of the package is to display the predicted membrane topology for seven transmembrane helices in seven colours (VIBGYOR colouring scheme) and to map the identified motifs on its respective helices /loop regions. It is an interactive package which provides options to the user to submit query or pre-aligned set of GPCR sequences to align with a reference sequence, like rhodopsin, whose structure has been solved experimentally. It also provides the possibility to identify the nearest homologue from the available inbuilt GPCR or Olfactory Receptor cluster dataset whose association is already known for its receptor type. AVAILABILITY: The database is available for free at mini@ncbs.res.in.  相似文献   

15.
Linking similar proteins structurally is a challenging task that may help in finding the novel members of a protein family. In this respect, identification of conserved sequence can facilitate understanding and classifying the exact role of proteins. However, the exact role of these conserved elements cannot be elucidated without structural and physiochemical information. In this work, we present a novel desktop application MotViz designed for searching and analyzing the conserved sequence segments within protein structure. With MotViz, the user can extract a complete list of sequence motifs from loaded 3D structures, annotate the motifs structurally and analyze their physiochemical properties. The conservation value calculated for an individual motif can be visualized graphically. To check the efficiency, predicted motifs from the data sets of 9 protein families were analyzed and MotViz algorithm was more efficient in comparison to other online motif prediction tools. Furthermore, a database was also integrated for storing, retrieving and performing the detailed functional annotation studies. In summary, MotViz effectively predicts motifs with high sensitivity and simultaneously visualizes them into 3D strucures. Moreover, MotViz is user-friendly with optimized graphical parameters and better processing speed due to the inclusion of a database at the back end. MotViz is available at http://www.fi-pk.com/motviz.html.  相似文献   

16.
C Sander  R Schneider 《Proteins》1991,9(1):56-68
The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.  相似文献   

17.
The PSI-BLAST algorithm has been acknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. This has been demonstrated by its ability to recognize remote structural homologues and by the greatest coverage it enables in annotation of a complete genome. Although recognizing the correct fold of a sequence is of major importance, the accuracy of the alignment is crucial for the success of modeling one sequence by the structure of its remote homologue. Here we assess the accuracy of PSI-BLAST alignments on a stringent database of 123 structurally similar, sequence-dissimilar pairs of proteins, by comparing them to the alignments defined on a structural basis. Each protein sequence is compared to a nonredundant database of the protein sequences by PSI-BLAST. Whenever a pair member detects its pair-mate, the positions that are aligned both in the sequential and structural alignments are determined, and the alignment sensitivity is expressed as the percentage of these positions out of the structural alignment. Fifty-two sequences detected their pair-mates (for 16 pairs the success was bi-directional when either pair member was used as a query). The average percentage of correctly aligned residues per structural alignment was 43.5+/-2.2%. Other properties of the alignments were also examined, such as the sensitivity vs. specificity and the change in these parameters over consecutive iterations. Notably, there is an improvement in alignment sensitivity over consecutive iterations, reaching an average of 50.9+/-2.5% within the five iterations tested in the current study.  相似文献   

18.
The CATH database of domain structures has been used to explore the structural variation of homologous domains in 294 well populated domain structure superfamilies, each containing at least three sequence diverse relatives. Our analyses confirm some previously detected trends relating sequence divergence to structural variation but for a much larger dataset and in some superfamilies the new data reveal exceptional structural variation. Use of a new algorithm (2DSEC) to analyse variability in secondary structure compositions across a superfamily sheds new light on how structures evolve. 2DSEC detects inserted secondary structures that embellish the core of conserved secondary structures found throughout the superfamily. Analysis showed that for 56% of highly populated superfamilies (>9 sequence diverse relatives), there are twofold or more increases in the numbers of secondary structures in some relatives. In some families fivefold increases occur, sometimes modifying the fold of the domain. Manual inspection of secondary structure insertions or embellishments in 48 particularly variable superfamilies revealed that although these insertions were usually discontiguous in the sequence they were often co-located in 3D resulting in a larger structural motif that often modified the geometry of the active site or the surface conformation promoting diverse domain partnerships and protein interactions. These observations, supported by automatic analysis of all well populated CATH families, suggest that accretion of small secondary structure insertions may provide a simple mechanism for evolving new functions in diverse relatives. Some layered domain architectures (e.g. mainly-beta and alpha-beta sandwiches) that recur highly in the genomes more frequently exploit these types of embellishments to modify function. In these architectures, aggregation occurs most often at the edges, top or bottom of the beta-sheets. Information on structural variability across domain superfamilies has been made available through the CATH Dictionary of Homologous Structures (DHS).  相似文献   

19.
MOTIVATION: Many evolutionarily distant, but functionally meaningful links between proteins come to light through comparison of spatial structures. Most programs that assess structural similarity compare two proteins to each other and find regions in common between them. Structural classification experts look for a particular structural motif instead. Programs base similarity scores on superposition or closeness of either Cartesian coordinates or inter-residue contacts. Experts pay more attention to the general orientation of the main chain and mutual spatial arrangement of secondary structural elements. There is a need for a computational tool to find proteins with the same secondary structures, topological connections and spatial architecture, regardless of subtle differences in 3D coordinates. RESULTS: We developed ProSMoS--a Protein Structure Motif Search program that emulates an expert. Starting from a spatial structure, the program uses previously delineated secondary structural elements. A meta-matrix of interactions between the elements (parallel or antiparallel) minding handedness of connections (left or right) and other features (e.g. element lengths and hydrogen bonds) is constructed prior to or during the searches. All structures are reduced to such meta-matrices that contain just enough information to define a protein fold, but this definition remains very general and deviations in 3D coordinates are tolerated. User supplies a meta-matrix for a structural motif of interest, and ProSMoS finds all proteins in the protein data bank (PDB) that match the meta-matrix. ProSMoS performance is compared to other programs and is illustrated on a beta-Grasp motif. A brief analysis of all beta-Grasp-containing proteins is presented. Program availability: ProSMoS is freely available for non-commercial use from ftp://iole.swmed.edu/pub/ProSMoS.  相似文献   

20.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号