首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is directly derived from, and thus compatible with, the BLOSUM matrices. The model has the additional advantage of being easily implemented.  相似文献   

2.

Background  

Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.  相似文献   

3.
Empirical models of substitution are often used in protein sequence analysis because the large alphabet of amino acids requires that many parameters be estimated in all but the simplest parametric models. When information about structure is used in the analysis of substitutions in structured RNA, a similar situation occurs. The number of parameters necessary to adequately describe the substitution process increases in order to model the substitution of paired bases. We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175-177; Wuyts et al. 2002. Nucleic Acids Res. 30:183-185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:997-1010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix. The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes. Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously.  相似文献   

4.
HIV-1 subtype phylogeny is investigated using a previously developed computational model of natural amino acid site substitutions. This model, based on Boltzmann statistics and Metropolis kinetics, involves an order of magnitude fewer adjustable parameters than traditional substitution matrices and deals more effectively with the issue of protein site heterogeneity. When optimized for sequences of HIV-1 envelope (env) proteins from a few specific subtypes, our model is more likely to describe the evolutionary record for other subtypes than are methods using a single substitution matrix, even a matrix optimized over the same data. Pairwise distances are calculated between various probabilistic ancestral subtype sequences, and a distance matrix approach is used to find the optimal phylogenetic tree. Our results indicate that the relationships between subtypes B, C, and D and those between subtypes A and H may be closer than previously thought.  相似文献   

5.
Measuring evolutionary distances between DNA or protein sequences forms the basis of many applications in computational biology and evolutionary studies. Of particular interest are distances based on synonymous substitutions, since these substitutions are considered to be under very little selection pressure and therefore assumed to accumulate in an almost clock-like manner. SynPAM, the method presented here, allows the estimation of distances between coding DNA sequences based on synonymous codon substitutions. The problem of estimating an accurate distance from the observed substitution pattern is solved by maximum-likelihood with empirical codon substitution matrices employed for the underlying Markov model. Comparisons with established measures of synonymous distance indicate that SynPAM has less variance and yields useful results over a longer time range.  相似文献   

6.
The evolutionary selection forces acting on a protein are commonly inferred using evolutionary codon models by contrasting the rate of synonymous to nonsynonymous substitutions. Most widely used models are based on theoretical assumptions and ignore the empirical observation that distinct amino acids differ in their replacement rates. In this paper, we develop a general method that allows assimilation of empirical amino acid replacement probabilities into a codon-substitution matrix. In this way, the resulting codon model takes into account not only the transition-transversion bias and the nonsynonymous/synonymous ratio, but also the different amino acid replacement probabilities as specified in empirical amino acid matrices. Different empirical amino acid replacement matrices, such as secondary structure-specific matrices or organelle-specific matrices (e.g., mitochondria and chloroplasts), can be incorporated into the model, making it context dependent. Using a diverse set of coding DNA sequences, we show that the novel model better fits biological data as compared with either mechanistic or empirical codon models. Using the suggested model, we further analyze human immunodeficiency virus type 1 protease sequences obtained from drug-treated patients and reveal positive selection in sites that are known to confer drug resistance to the virus.  相似文献   

7.
Measuring evolutionary distances between DNA or protein sequences forms the basis of many applications in computational biology and evolutionary studies. Of particular interest are distances based on synonymous substitutions since these substitutions are considered to be under very little selection pressure and therefore assumed to accumulate in an almost clock-like manner. SynPAM, the method presented here, allows the estimation of distances between coding DNA sequences based on synonymous codon substitutions. The problem of estimating an accurate distance from the observed substitution pattern is solved by maximum likelihood with empirical codon substitution matrices employed for the underlying Markov model. Comparisons with established measures of synonymous distance indicate that SynPAM has less variance and yields useful results over a longer time range.  相似文献   

8.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

9.
Proteins evolve under a myriad of biophysical selection pressures that collectively control the patterns of amino acid substitutions. These evolutionary pressures are sufficiently consistent over time and across protein families to produce substitution patterns, summarized in global amino acid substitution matrices such as BLOSUM, JTT, WAG, and LG, which can be used to successfully detect homologs, infer phylogenies, and reconstruct ancestral sequences. Although the factors that govern the variation of amino acid substitution rates have received much attention, the influence of thermodynamic stability constraints remains unresolved. Here we develop a simple model to calculate amino acid substitution matrices from evolutionary dynamics controlled by a fitness function that reports on the thermodynamic effects of amino acid mutations in protein structures. This hybrid biophysical and evolutionary model accounts for nucleotide transition/transversion rate bias, multi‐nucleotide codon changes, the number of codons per amino acid, and thermodynamic protein stability. We find that our theoretical model accurately recapitulates the complex yet universal pattern observed in common global amino acid substitution matrices used in phylogenetics. These results suggest that selection for thermodynamically stable proteins, coupled with nucleotide mutation bias filtered by the structure of the genetic code, is the primary driver behind the global amino acid substitution patterns observed in proteins throughout the tree of life.  相似文献   

10.
Amino acid sequences of peptides are often inferred from their amino acid compositions by comparison with homologous peptides of known sequence. The probabilities are considered that by such an approach errors are made due to the occurrence of balanced double changes, i.e. reciprocal substitutions, between two homologous peptides of identical compositions. Formulae are derived for the calculation of these probabilities, depending on peptide length and evolutionary distance. However, such calculations requiring too much computer time, the probabilities for reciprocal substitutions are estimated by simulation of evolutionary changes in peptides. It can be concluded from the resulting data that for many purposes the possible errors in amino acid sequences partially inferred from amino acid compositions are acceptably small.  相似文献   

11.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

12.
A protein alignment scoring system sensitive at all evolutionary distances   总被引:1,自引:0,他引:1  
Summary Protein sequence alignments generally are constructed with the aid of a substitution matrix that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a log-odds matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.  相似文献   

13.
We develop here an analytical evolution model based on a dinucleotide mutation matrix 16 x 16 with six substitution parameters associated with the three types of substitutions in the two dinucleotide sites. It generalizes the previous models based on the nucleotide mutation matrices 4 x 4. It determines at some time t the exact occurrence probabilities of dinucleotides mutating randomly according to these six substitution parameters. Furthermore, several properties and two applications of this model allow to derive 16 evolutionary analytical solutions of dinucleotides and also a dinucleotide phylogenetic distance. Finally, based on this mathematical model, the SED (Stochastic Evolution of Dinucleotides) web server has been developed for deriving evolutionary analytical solutions of dinucleotides.  相似文献   

14.
本文根据脊髓灰质炎病毒3个型别参考毒株的基因组核苷酸序列和蛋白质氨基酸序列资料,首次试用Kimura的分子进化理论和计算方法,推算出脊髓灰质炎病毒型间的进化距离、分歧进化时间及病毒蛋白质氨基酸的替换率。结果表明:(1)型间毒株相互进化距离大致相等;(2)三个型病毒是由一个共同祖先病毒在距今约1—2千年以前几乎同时分歧进化而来;(3)型间毒株蛋白质氨基酸替换率也大致相等。  相似文献   

15.
Amino acid substitutions in evolutionarily related proteins have been studied from a structural point of view. We consider here that an amino acid al in a protein p1 has been replaced by the amino acid a2 in the structurally similar protein p2 if, after superposition of the p1 and p2 structures, the a1 and a2 C alpha atoms are no more than 1.2 A apart. Thirty-two proteins, grouped in 11 classes, have been analysed by this method. This produced 2860 amino acid pairs (substitutions), which were analysed by multi-dimensional statistical methods. The main results are as follows: (1) according to the observed exchangeability of amino acid side-chains, only four groups (strong clusters) could be delineated; (i) Ile and Val, (ii) Leu and Met, (iii) Lys, Arg and Gln, and (iv) Tyr and Phe. The other residues could not be classified. (2) The matrix of distances between amino acids, or scoring matrix, determined from this study, is different from any other published matrix. (3) Except for the distance matrices based on the chemical properties of amino acid side-chains, which can be grouped together, all other published matrices are different from one another. (4) The distance matrix determined in this study seems to be very efficient for aligning distantly related protein sequences.  相似文献   

16.
There has been considerable interest in the problem of making maximum likelihood (ML) evolutionary trees which allow insertions and deletions. This problem is partly one of formulation: how does one define a probabilistic model for such trees which treats insertion and deletion in a biologically plausible manner? A possible answer to this question is proposed here by extending the concept of a hidden Markov model (HMM) to evolutionary trees. The model, called a tree-HMM, allows what may be loosely regarded as learnable affine-type gap penalties for alignments. These penalties are expressed in HMMs as probabilities of transitions between states. In the tree-HMM, this idea is given an evolutionary embodiment by defining trees of transitions. Just as the probability of a tree composed of ungapped sequences is computed, by Felsenstein's method, using matrices representing the probabilities of substitutions of residues along the edges of the tree, so the probabilities in a tree-HMM are computed by substitution matrices for both residues and transitions. How to define these matrices by a ML procedure using an algorithm that learns from a database of protein sequences is shown here. Given these matrices, one can define a tree-HMM likelihood for a set of sequences, assuming a particular tree topology and an alignment of the sequences to the model. If one could efficiently find the alignment which maximizes (or comes close to maximizing) this likelihood, then one could search for the optimal tree topology for the sequences. An alignment algorithm is defined here which, given a particular tree topology, is guaranteed to increase the likelihood of the model. Unfortunately, it fails to find global optima for realistic sequence sets. Thus further research is needed to turn the tree-HMM into a practical phylogenetic tool.  相似文献   

17.
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.  相似文献   

18.
MOTIVATION: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.  相似文献   

19.
Mitochondrial DNA (mtDNA) sequences are widely used for inferring the phylogenetic relationships among species. Clearly, the assumed model of nucleotide or amino acid substitution used should be as realistic as possible. Dependence among neighboring nucleotides in a codon complicates modeling of nucleotide substitutions in protein-encoding genes. It seems preferable to model amino acid substitution rather than nucleotide substitution. Therefore, we present a transition probability matrix of the general reversible Markov model of amino acid substitution for mtDNA-encoded proteins. The matrix is estimated by the maximum likelihood (ML) method from the complete sequence data of mtDNA from 20 vertebrate species. This matrix represents the substitution pattern of the mtDNA-encoded proteins and shows some differences from the matrix estimated from the nuclear-encoded proteins. The use of this matrix would be recommended in inferring trees from mtDNA-encoded protein sequences by the ML method. Received: 3 May 1995 / Accepted: 31 October 1995  相似文献   

20.
Miyazawa S 《PloS one》2011,6(3):e17244

Background

Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices.

Results

Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins.

Conclusions/Significance

The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号