首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level and it is for this purpose we have constructed the program RevTrans. RevTrans constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning the resulting peptide sequences; and (iii) building a multiple DNA alignment by 'reverse translation' of the aligned protein sequences. In the resulting DNA alignment, gaps occur in groups of three corresponding to entire codons, and analogous codon positions are therefore always lined up. These features are useful when constructing multiple DNA alignments for phylogenetic analysis. RevTrans also accepts user-provided protein alignments for greater control of the alignment process. The RevTrans web server is freely available at http://www.cbs.dtu.dk/services/RevTrans/.  相似文献   

2.
MOTIVATION: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.  相似文献   

3.
Shatsky M  Nussinov R  Wolfson HJ 《Proteins》2006,62(1):209-217
Routinely used multiple-sequence alignment methods use only sequence information. Consequently, they may produce inaccurate alignments. Multiple-structure alignment methods, on the other hand, optimize structural alignment by ignoring sequence information. Here, we present an optimization method that unifies sequence and structure information. The alignment score is based on standard amino acid substitution probabilities combined with newly computed three-dimensional structure alignment probabilities. The advantage of our alignment scheme is in its ability to produce more accurate multiple alignments. We demonstrate the usefulness of the method in three applications: 1) computing more accurate multiple-sequence alignments, 2) analyzing protein conformational changes, and 3) computation of amino acid structure-sequence conservation with application to protein-protein docking prediction. The method is available at http://bioinfo3d.cs.tau.ac.il/staccato/.  相似文献   

4.
Amino acid residues that are involved in functional interactions in proteins have strong evolutionary pressure to remain unchanged and consequently their substitution patterns are different from those that are noninteracting. To characterize and quantify the differences between amino acid substitution patterns due to structural restraints and those under functional restraints, we have made a comparative analysis of families of homologous proteins. Residues classified as having the same amino acid type, secondary structure, accessibility, and side-chain hydrogen bonds are shown to be better conserved if they are close to the active site. We have focused on enzyme families for this analysis since they have functional sites that are easily defined by their catalytic residues. We have derived new sets of environment-specific substitution tables, which we term function-dependent environment-specific substitution tables, where amino acid residues are classified according to their distance from the functional sites. The residues that are within a distance of 9 A from the active site have distinct amino acid substitution patterns when compared to the other sites. The function-dependent environment-specific substitution tables have been tested using the sequence-structure homology recognition program FUGUE and the results compared with the recognition performance obtained using the standard environment-specific substitution tables. Significant improvements are obtained in both recognition performance and alignment accuracy using the function-dependent environment-specific substitution tables (P-value = 0.02, according to the Wilcoxon signed rank test for alignment accuracy). The alignments near the active site are greatly improved with pronounced improvements at lower percentage identities (less than 30%).  相似文献   

5.
Tillier ER  Biro L  Li G  Tillo D 《Proteins》2006,63(4):822-831
Approaches for the determination of interacting partners from different protein families (such as ligands and their receptors) have made use of the property that interacting proteins follow similar patterns and relative rates of evolution. Interacting protein partners can then be predicted from the similarity of their phylogenetic trees or evolutionary distances matrices. We present a novel method called Codep, for the determination of interacting protein partners by maximizing co-evolutionary signals. The order of sequences in the multiple sequence alignments from two protein families is determined in such a manner as to maximize the similarity of substitution patterns at amino acid sites in the two alignments and, thus, phylogenetic congruency. This is achieved by maximizing the total number of interdependencies of amino acids sites between the alignments. Once ordered, the corresponding sequences in the two alignments indicate the predicted interacting partners. We demonstrate the efficacy of this approach with computer simulations and in analyses of several protein families. A program implementing our method, Codep, is freely available to academic users from our website: http://www.uhnresearch.ca/labs/tillier/.  相似文献   

6.
To learn more about the evolutionary origins of Escherichia coli genes, we surveyed systematically for extended sequence similarities among the 1,264 amino acid sequences encoded by chromosomal genes of E. coli K-12 in SwissProt release 26 by using the FASTA program and imposing the following criteria: (i) alignment of segments at least 100 amino acids long and (ii) at least 20% amino acid identity. Altogether, 624 extended alignments meeting the two criteria were identified, corresponding to 577 protein sequences (45.6% of the 1,264 E. coli protein sequences) that had an extended alignment with at least one other E. coli protein sequence. To exclude alignments of questionable biological significance, we imposed a high threshold on the number of gaps allowed in each of the 624 extended alignments, giving us a subset of 464 proteins. The population of 464 alignments has the following characteristics expressed as median values of the group: 254 amino acids in the alignment, representing 86% of the length of the protein, 33% of the amino acids in the alignment being identical, and 1.1 gaps introduced per 100 amino acids of alignment. Where functions are known, nearly all pairs consist of functionally related proteins. This implies that the sequence similarity we detected has biological meaning and did not arise by chance. That a major fraction of E. coli proteins form extended alignments strongly suggests the predominance of duplication and divergence of ancestral genes in the evolution of E. coli genes. The range of degrees of similarity shows that some genes originated more recently than others. There is no evidence of genome doubling in the past, since map distances between genes of sequence-related proteins show no coherent pattern of favored separations.  相似文献   

7.
Pairwise alignment incorporating dipeptide covariation   总被引:1,自引:0,他引:1  
MOTIVATION: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrices that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations and by assessing the ability of this algorithm to detect remote homologies. RESULTS: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation.  相似文献   

8.
MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.  相似文献   

9.
Melo F  Marti-Renom MA 《Proteins》2006,63(4):986-995
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs.  相似文献   

10.
Protein sequence alignments are more reliable the shorter the evolutionary distance. Here, we align distantly related proteins using many closely spaced intermediate sequences as stepping stones. Such transitive alignments can be generated between any two proteins in a connected set, whether they are direct or indirect sequence neighbors in the underlying library of pairwise alignments. We have implemented a greedy algorithm, MaxFlow, using a novel consistency score to estimate the relative likelihood of alternative paths of transitive alignment. In contrast to traditional profile models of amino acid preferences, MaxFlow models the probability that two positions are structurally equivalent and retains high information content across large distances in sequence space. Thus, MaxFlow is able to identify sparse and narrow active-site sequence signatures which are embedded in high-entropy sequence segments in the structure based multiple alignment of large diverse enzyme superfamilies. In a challenging benchmark based on the urease superfamily, MaxFlow yields better reliability and double coverage compared to available sequence alignment software. This promises to increase information returns from functional and structural genomics, where reliable sequence alignment is a bottleneck to transferring the functional or structural characterization of model proteins to entire protein superfamilies.  相似文献   

11.
Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.  相似文献   

12.
The alignments of the amino acid sequences of inorganic pyrophosphatase (PPase) from Saccharomyces cerevisiae (Y1-PPase, 286 amino acids) and Escherichia coli (E-PPase, 175 amino acids) are examined in the light of crystallographic and chemical modification results placing specific amino acid residues at the active site of the yeast enzyme. The major results are: (1) the full E-PPase sequence aligns within residues 28-225 of Y1-PPase, raising the possibility that the N-terminal and C-terminal portions of Y1-PPase may not be essential for activity, and (2) that whereas the overall identity between the two sequences is only modest (22-27% depending on the choice of alignment parameters), of some 17 putative active site residues, 14-16 are identical between Y-PPase and E-PPase. PPase thus appears to be an example of enzymes from widely divergent species that conserve common functional elements within the context of substantial overall sequence variation.  相似文献   

13.
We study to what degree patterns of amino acid substitution vary between genes using two models of protein-coding gene evolution. The first divides the amino acids into groups, with one substitution rate for pairs of residues in the same group and a second for those in differing groups. Unlike previous applications of this model, the groups themselves are estimated from data by simulated annealing. The second model makes substitution rates a function of the physical and chemical similarity between two residues. Because we model the evolution of coding DNA sequences as opposed to protein sequences, artifacts arising from the differing numbers of nucleotide substitutions required to bring about various amino acid substitutions are avoided. Using 10 alignments of related sequences (five of orthologous genes and five gene families), we do find differences in substitution patterns. We also find that, although patterns of amino acid substitution vary temporally within the history of a gene, variation is not greater in paralogous than in orthologous genes. Improved understanding of such gene-specific variation in substitution patterns may have implications for applications such as sequence alignment and phylogenetic inference.  相似文献   

14.
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.  相似文献   

15.
Multiple sequence alignment is an essential tool in many areas of biological research, and the accuracy of an alignment can strongly affect the accuracy of a downstream application such as phylogenetic analysis, identification of functional motifs, or polymerase chain reaction primer design. The heads or tails (HoT) method (Landan G, Graur D. 2007. Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol. 24:1380-1383.) assesses the consistency of an alignment by comparing the alignment of a set of sequences with the alignment of the same set of sequences written in reverse order. This study shows that HoT scores and the alignment accuracies are positively correlated, so alignments with higher HoT scores are preferable. However, HoT scores are overestimates of alignment accuracy in general, with the extent of overestimation depending on the method used for multiple sequence alignment.  相似文献   

16.
17.
Plastids are descended from a cyanobacterial symbiosis which occurred over 1.2 billion years ago. During the course of endosymbiosis, most genes were lost from the cyanobacterium's genome and many were relocated to the host nucleus through endosymbiotic gene transfer (EGT). The issue of how many genes were acquired through EGT in different plant lineages is unresolved. Here, we report the genome-wide frequency of gene acquisitions from cyanobacteria in 4 photosynthetic eukaryotes--Arabidopsis, rice, Chlamydomonas, and the red alga Cyanidioschyzon--by comparison of the 83,138 proteins encoded in their genomes with 851,607 proteins encoded in 9 sequenced cyanobacterial genomes, 215 other reference prokaryotic genomes, and 13 reference eukaryotic genomes. The analyses entail 11,569 phylogenies inferred with both maximum likelihood and Neighbor-Joining approaches. Because each phylogenetic result is dependent not only upon the reconstruction method but also upon the site patterns in the underlying alignment, we investigated how the reliability of site pattern generation via alignment affects our results: if the site patterns in an alignment differ depending upon the order in which amino acids are introduced into multiple sequence alignment--N- to C-terminal versus C- to N-terminal--then the phylogenetic result is likely to be artifactual. Excluding unreliable alignments by this means, we obtain a conservative estimate, wherein about 14% of the proteins examined in each plant genome indicate a cyanobacterial origin for the corresponding nuclear gene, with higher proportions (17-25%) observed among the more reliable alignments. The identification of cyanobacterial genes in plant genomes affords access to an important question: from which type of cyanobacterium did the ancestor of plastids arise? Among the 9 cyanobacterial genomes sampled, Nostoc sp. PCC7120 and Anabaena variabilis ATCC29143 were found to harbor collections of genes which are-in terms of presence/absence and sequence similarity-more like those possessed by the plastid ancestor than those of the other 7 cyanobacterial genomes sampled here. This suggests that the ancestor of plastids might have been an organism more similar to filamentous, heterocyst-forming (nitrogen-fixing) representatives of section IV recognized in Stanier's cyanobacterial classification. Members of section IV are very common partners in contemporary symbiotic associations involving endosymbiotic cyanobacteria, which generally provide nitrogen to their host, consistent with suggestions that fixed nitrogen supplied by the endosymbiont might have played an important role during the origin of plastids.  相似文献   

18.
Alignment of nucleotide and/or amino acid sequences is a fundamental component of sequence‐based molecular phylogenetic studies. Here we examined how different alignment methods affect the phylogenetic trees that are inferred from the alignments. We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment and direct optimization. When taking into account branch support, implied alignments produced by direct optimization were found to show the most extreme behaviour (based on the alignment programs for which nearly equivalent alignment parameters could be set) in that they provided the strongest support for the correct tree in the simulations in which it was easy to resolve the correct tree and the strongest support for the incorrect tree in our long‐branch‐attraction simulations. When applied to alignment‐sensitive process partitions with different histories, direct optimization showed the strongest mutual influence between the process partitions when they were aligned and phylogenetically analysed together, which makes detecting recombination more difficult. Simultaneous alignment performed well relative to direct optimization and progressive pairwise alignment across all simulations. Rather than relying upon methods that integrate alignment and tree search into a single step without accounting for alignment uncertainty, as with implied alignments, we suggest that simultaneous alignment using the similarity criterion, within the context of information available on biological processes and function, be applied whenever possible for sequence‐based phylogenetic analyses.  相似文献   

19.
For applications such as comparative modelling one major issue is the reliability of sequence alignments. Reliable regions in alignments can be predicted using sub-optimal alignments of the same pair of sequences. Here we show that reliable regions in alignments can also be predicted from multiple sequence profile information alone.Alignments were created for a set of remotely related pairs of proteins using five different test methods. Structural alignments were used to assess the quality of the alignments and the aligned positions were scored using information from the observed frequencies of amino acid residues in sequence profiles pre-generated for each template structure. High-scoring regions of these profile-derived alignment scores were a good predictor of reliably aligned regions.These profile-derived alignment scores are easy to obtain and are applicable to any alignment method. They can be used to detect those regions of alignments that are reliably aligned and to help predict the quality of an alignment. For those residues within secondary structure elements, the regions predicted as reliably aligned agreed with the structural alignments for between 92% and 97.4% of the residues. In loop regions just under 92% of the residues predicted to be reliable agreed with the structural alignments. The percentage of residues predicted as reliable ranged from 32.1% for helix residues to 52.8% for strand residues.This information could also be used to help predict conserved binding sites from sequence alignments. Residues in the template that were identified as binding sites, that aligned to an identical amino acid residue and where the sequence alignment agreed with the structural alignment were in highly conserved, high scoring regions over 80% of the time. This suggests that many binding sites that are present in both target and template sequences are in sequence-conserved regions and that there is the possibility of translating reliability to binding site prediction.  相似文献   

20.
MOTIVATION: Current projects for the massive characterization of proteomes are generating protein sequences and structures with unknown function. The difficulty of experimentally determining functionally important sites calls for the development of computational methods. The first techniques, based on the search for fully conserved positions in multiple sequence alignments (MSAs), were followed by methods for locating family-dependent conserved positions. These rely on the functional classification implicit in the alignment for locating these positions related with functional specificity. The next obvious step, still scarcely explored, is to detect these positions using a functional classification different from the one implicit in the sequence relationships between the proteins. Here, we present two new methods for locating functional positions which can incorporate an arbitrary external functional classification which may or may not coincide with the one implicit in the MSA. The Xdet method is able to use a functional classification with an associated hierarchy or similarity between functions to locate positions related to that classification. The MCdet method uses multivariate statistical analysis to locate positions responsible for each one of the functions within a multifunctional family. RESULTS: We applied the methods to different cases, illustrating scenarios where there is a disagreement between the functional and the phylogenetic relationships, and demonstrated their usefulness for the phylogeny-independent prediction of functional positions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号