首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.  相似文献   

2.
A 3.1-kb intergenic DNA fragment located between the psi beta-globin and delta-globin genes in the beta-globin gene cluster was cloned from gorilla, orangutan, rhesus monkey, and spider monkey, and the nucleotide sequence of each fragment was determined. The phylogeny of these four sequences, together with two previously published allelic sequences from humans and one from chimpanzee, was constructed, and the accumulation of mutations in the region was analyzed. The sites of base substitutions are not evenly distributed within the region: two Alu repeats have accumulated 0.21 + 0.02 substitutions/site with 0.15 + 0.008 substitutions/site in the remainder of the fragment. The occurrence of substitutions at neighboring sites is more frequent than would be expected if they were independent. The observed excesses disappear when ancestral -CG- dinucleotide sites are excluded. The phylogenetic relationships of the sequences indicate that the human sequence shares a most recent coancestor with the chimpanzee sequence. The data also show that great apes have accumulated fewer mutations in this part of the genome than has the rhesus monkey. The relative rates of accumulation of 12 kinds of nucleotide substitution in the region during primate evolution are asymmetric in the DNA strands. From these rates of accumulation, the origin of a simple stretch of sequence near the 3' end of the 3.1-kb fragment was deduced to be a sequence comprising 50% T and 50% C on one strand. The two oppositely oriented Alu sequences in the 3.1-kb region were inserted at their present positions before the divergence of the New-World monkeys from other lineages. Our analysis shows that the nucleotide sequences of the two Alu repeats in spider monkey are unexpectedly similar both to each other and to the deduced ancestral sequence of Alu repeats. The data suggest that there has been some type of recombinational event between the spider monkey Alu repeats but that it was not a simple gene conversion.   相似文献   

3.
Differential rates of nucleotide substitution among different gene segments and between distinct evolutionary lineages is well documented among mitochondrial genes and is likely a consequence of locus-specific selective constraints that delimit mutational divergence over evolutionary time. We compared sequence variation of 18 homologous loci (15 coding genes and 3 parts of the control region) among 10 mammalian mitochondrial DNA genomes which allowed us to describe different mitochondrial evolutionary patterns and to produce an estimation of the relative order of gene divergence. The relative rates of divergence of mitochondrial DNA genes in the family Felidae were estimated by comparing their divergence from homologous counterpart genes included in nuclear mitochondrial DNA (Numt, pronounced "new might"), a genomic fossil that represents an ancient transfer of 7.9 kb of mitochondrial DNA to the nuclear genome of an ancestral species of the domestic cat (Felis catus). Phylogenetic analyses of mitochondrial (mtDNA) sequences with multiple outgroup species were conducted to date the ancestral node common to the Numt and the cytoplasmic (Cymt) mtDNA genes and to calibrate the rate of sequence divergence of mitochondrial genes relative to nuclear homologous counterparts. By setting the fastest substitution rate as strictly mutational, an empirical "selective retardation index" is computed to quantify the sum of all constraints, selective and otherwise, that limit sequence divergence of mitochondrial gene sequences over time.   相似文献   

4.
We propose two approximate methods (one based on parsimony and one on pairwise sequence comparison) for estimating the pattern of nucleotide substitution and a parsimony-based method for estimating the gamma parameter for variable substitution rates among sites. The matrix of substitution rates that represents the substitution pattern can be recovered through its relationship with the observable matrix of site pattern frequences in pairwise sequence comparisons. In the parsimony approach, the ancestral sequences reconstructed by the parsimony algorithm were used, and the two sequences compared are those at the ends of a branch in the phylogenetic tree. The method for estimating the gamma parameter was based on a reinterpretation of the numbers of changes at sites inferred by parsimony. Three data sets were analyzed to examine the utility of the approximate methods compared with the more reliable likelihood methods. The new methods for estimating the substitution pattern were found to produce estimates quite similar to those obtained from the likelihood analyses. The new method for estimating the gamma parameter was effective in reducing the bias in conventional parsimony estimates, although it also overestimated the parameter. The approximate methods are computationally very fast and appear useful for analyzing large data sets, for which use of the likelihood method requires excessive computation.   相似文献   

5.
Akashi H  Goel P  John A 《PloS one》2007,2(10):e1065
Reliable inference of ancestral sequences can be critical to identifying both patterns and causes of molecular evolution. Robustness of ancestral inference is often assumed among closely related species, but tests of this assumption have been limited. Here, we examine the performance of inference methods for data simulated under scenarios of codon bias evolution within the Drosophila melanogaster subgroup. Genome sequence data for multiple, closely related species within this subgroup make it an important system for studying molecular evolutionary genetics. The effects of asymmetric and lineage-specific substitution rates (i.e., varying levels of codon usage bias and departures from equilibrium) on the reliability of ancestral codon usage was investigated. Maximum parsimony inference, which has been widely employed in analyses of Drosophila codon bias evolution, was compared to an approach that attempts to account for uncertainty in ancestral inference by weighting ancestral reconstructions by their posterior probabilities. The latter approach employs maximum likelihood estimation of rate and base composition parameters. For equilibrium and most non-equilibrium scenarios that were investigated, the probabilistic method appears to generate reliable ancestral codon bias inferences for molecular evolutionary studies within the D. melanogaster subgroup. These reconstructions are more reliable than parsimony inference, especially when codon usage is strongly skewed. However, inference biases are considerable for both methods under particular departures from stationarity (i.e., when adaptive evolution is prevalent). Reliability of inference can be sensitive to branch lengths, asymmetry in substitution rates, and the locations and nature of lineage-specific processes within a gene tree. Inference reliability, even among closely related species, can be strongly affected by (potentially unknown) patterns of molecular evolution in lineages ancestral to those of interest.  相似文献   

6.
Most of the sophisticated methods to estimate evolutionary divergence between DNA sequences assume that the two sequences have evolved with the same pattern of nucleotide substitution after their divergence from their most recent common ancestor (homogeneity assumption). If this assumption is violated, the evolutionary distance estimated will be biased, which may result in biased estimates of divergence times and substitution rates, and may lead to erroneous branching patterns in the inferred phylogenies. Here we present a simple modification for existing distance estimation methods to relax the assumption of the substitution pattern homogeneity among lineages when analyzing DNA and protein sequences. Results from computer simulations and empirical data analyses for human and mouse genes are presented to demonstrate that the proposed modification reduces the estimation bias considerably and that the modified method performs much better than the LogDet methods, which do not require the homogeneity assumption in estimating the number of substitutions per site. We also discuss the relationship of the substitution and mutation rate estimates when the substitution pattern is not the same in the lineages leading to the two sequences compared.  相似文献   

7.
Goonesekere NC  Lee B 《Proteins》2008,71(2):910-919
The sequence homology detection relies on score matrices, which reflect the frequency of amino acid substitutions observed in a dataset of homologous sequences. The substitution matrices in popular use today are usually constructed without consideration of the structural context in which the substitution takes place. Here, we present amino acid substitution matrices specific for particular polar-nonpolar environment of the amino acid. As expected, these matrices [context-specific substitution matrices (CSSMs)] show striking differences from the popular BLOSUM62 matrix, which does not include structural information. When incorporated into BLAST and PSI-BLAST, CSSM outperformed BLOSUM matrices as assessed by ROC curve analyses of the number of true and false hits and by the accuracy of the sequence alignments to the hit sequences. These findings are also of relevance to profile-profile-based methods of homology detection, since CSSMs may help build a better profile. Profiles generated for protein sequences in PDB using CSSM-PSI-BLAST will be made available for searching via RPSBLAST through our web site http://lmbbi.nci.nih.gov/.  相似文献   

8.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

9.
Because they are considered rare, balanced polymorphisms are often discounted as crucial constituents of genome‐wide variation in sequence diversity. Despite its perceived rarity, however, long‐term balancing selection can elevate genetic diversity and significantly affect observed divergence between species. Here, we discuss how ancestral balanced polymorphisms can be “sieved” by the speciation process, which sorts them unequally across descendant lineages. After speciation, ancestral balancing selection is revealed by genomic regions of high divergence between species. This signature, which resembles that of other evolutionary processes, can potentially confound genomic studies of population divergence and inferences of “islands of speciation.”  相似文献   

10.
Divergence time and substitution rate are seriously confounded in phylogenetic analysis, making it difficult to estimate divergence times when the molecular clock (rate constancy among lineages) is violated. This problem can be alleviated to some extent by analyzing multiple gene loci simultaneously and by using multiple calibration points. While different genes may have different patterns of evolutionary rate change, they share the same divergence times. Indeed, the fact that each gene may violate the molecular clock differently leads to the advantage of simultaneous analysis of multiple loci. Multiple calibration points provide the means for characterizing the local evolutionary rates on the phylogeny. In this paper, we extend previous likelihood models of local molecular clock for estimating species divergence times to accommodate multiple calibration points and multiple genes. Heterogeneity among different genes in evolutionary rate and in substitution process is accounted for by the models. We apply the likelihood models to analyze two mitochondrial protein-coding genes, cytochrome oxidase II and cytochrome b, to estimate divergence times of Malagasy mouse lemurs and related outgroups. The likelihood method is compared with the Bayes method of Thorne et al. (1998, Mol. Biol. Evol. 15:1647-1657), which uses a probabilistic model to describe the change in evolutionary rate over time and uses the Markov chain Monte Carlo procedure to derive the posterior distribution of rates and times. Our likelihood implementation has the drawbacks of failing to accommodate uncertainties in fossil calibrations and of requiring the researcher to classify branches on the tree into different rate groups. Both problems are avoided in the Bayes method. Despite the differences in the two methods, however, data partitions and model assumptions had the greatest impact on date estimation. The three codon positions have very different substitution rates and evolutionary dynamics, and assumptions in the substitution model affect date estimation in both likelihood and Bayes analyses. The results demonstrate that the separate analysis is unreliable, with dates variable among codon positions and between methods, and that the combined analysis is much more reliable. When the three codon positions were analyzed simultaneously under the most realistic models using all available calibration information, the two methods produced similar results. The divergence of the mouse lemurs is dated to be around 7-10 million years ago, indicating a surprisingly early species radiation for such a morphologically uniform group of primates.  相似文献   

11.
We have used three reference sequences representative of bacterial drug resistance pumps and sugar transport proteins to collect the 91 most closely related sequences from a composite, nonredundant protein sequence database. Having eliminated certain very close relatives, the remainder were subjected to analysis and alignment by using two different similarity matrices: one of these was a matrix based on structural conservation of amino acid residues in proteins of known conformation and the other was based on the more familiar mutational matrix. Unrooted similarity trees for these proteins were constructed for each matrix and compared. A systematic analysis of the differences between these trees was undertaken and the sequences were analyzed for the presence or absence of certain sequence motifs. The results show that the clades created by the two methods are broadly comparable but that there are some clusters of sequences that are significantly different. Further analysis confirmed that (1) the sequences collected by this objective method are all known or putative 12-helix (in some cases reported as 14-helix) transmembrane proteins, (2) there is evidence for few cases of an origin based on gene duplication, (3) the bacterial drug resistance pumps are distributed in more than one clade and cannot be regarded as a definitive subset of these proteins, and that (4) the diversity is such that there is no evidence of a single ancestral protein. The possible extension of the methods to other cases of divergent protein sequences is discussed.  相似文献   

12.
Inferring speciation times under an episodic molecular clock   总被引:5,自引:0,他引:5  
We extend our recently developed Markov chain Monte Carlo algorithm for Bayesian estimation of species divergence times to allow variable evolutionary rates among lineages. The method can use heterogeneous data from multiple gene loci and accommodate multiple fossil calibrations. Uncertainties in fossil calibrations are described using flexible statistical distributions. The prior for divergence times for nodes lacking fossil calibrations is specified by use of a birth-death process with species sampling. The prior for lineage-specific substitution rates is specified using either a model with autocorrelated rates among adjacent lineages (based on a geometric Brownian motion model of rate drift) or a model with independent rates among lineages specified by a log-normal probability distribution. We develop an infinite-sites theory, which predicts that when the amount of sequence data approaches infinity, the width of the posterior credibility interval and the posterior mean of divergence times form a perfect linear relationship, with the slope indicating uncertainties in time estimates that cannot be reduced by sequence data alone. Simulations are used to study the influence of among-lineage rate variation and the number of loci sampled on the uncertainty of divergence time estimates. The analysis suggests that posterior time estimates typically involve considerable uncertainties even with an infinite amount of sequence data, and that the reliability and precision of fossil calibrations are critically important to divergence time estimation. We apply our new algorithms to two empirical data sets and compare the results with those obtained in previous Bayesian and likelihood analyses. The results demonstrate the utility of our new algorithms.  相似文献   

13.
Phylogenetic inference: how much evolutionary history is knowable?   总被引:5,自引:2,他引:3  
In order to reconstruct phylogenetic trees from extremely dissimilar sequences it is necessary to estimate accurately the extent of sequence divergence. In this paper a new method of sequence analysis, Markov triple analysis, is developed for determining the relative frequencies of nucleotide substitutions within the three branches of a three-taxon dendrogram. Assuming that nucleotide sites are independently and identically distributed and assuming a Markov model for nucleotide (or protein) evolution, it is shown that the unique Markov matrices can be reconstructed given only the joint probability distribution relating three taxa. (In the much simpler case involving only two taxa and two character states, Markov matrices can also be reconstructed, provided symmetry assumptions are placed on the elements of the matrices.) The method is illustrated using sequence data from the combined first and second codon positions derived from complete human, mouse, and cow mitochondrial sequences.   相似文献   

14.
A compound poisson process for relaxing the molecular clock   总被引:18,自引:0,他引:18  
Huelsenbeck JP  Larget B  Swofford D 《Genetics》2000,154(4):1879-1892
The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionary biology despite the repeated observation that the clock hypothesis does not perfectly explain observed DNA sequence variation. We introduce a parametric model that relaxes the molecular clock by allowing rates to vary across lineages according to a compound Poisson process. Events of substitution rate change are placed onto a phylogenetic tree according to a Poisson process. When an event of substitution rate change occurs, the current rate of substitution is modified by a gamma-distributed random variable. Parameters of the model can be estimated using Bayesian inference. We use Markov chain Monte Carlo integration to evaluate the posterior probability distribution because the posterior probability involves high dimensional integrals and summations. Specifically, we use the Metropolis-Hastings-Green algorithm with 11 different move types to evaluate the posterior distribution. We demonstrate the method by analyzing a complete mtDNA sequence data set from 23 mammals. The model presented here has several potential advantages over other models that have been proposed to relax the clock because it is parametric and does not assume that rates change only at speciation events. This model should prove useful for estimating divergence times when substitution rates vary across lineages.  相似文献   

15.
Probabilities of monophyly, paraphyly, and polyphyly of two-species gene genealogies are computed for modest sample sizes and compared for two different Λ coalescent processes. Coalescent processes belonging to the Λ coalescent family admit asynchronous multiple mergers of active ancestral lineages. Assigning a timescale to the time of divergence becomes a central issue when different populations have different coalescent processes running on different timescales. Clade probabilities in single populations are also computed, which can be useful for testing for taxonomic distinctiveness of an observed set of monophyletic lineages. The coalescence rates of multiple merger coalescent processes are functions of coalescent parameters. The effect of coalescent parameters on the probabilities studied depends on the coalescent process, and if the population is ancestral or derived. The probability of reciprocal monophyly tends to be somewhat lower, when associated with a Λ coalescent, under the null hypothesis that two groups come from the same population. However, even for fairly recent divergence times, the probability of monophyly tends to be higher as a function of the number of generations for coalescent processes that admit multiple mergers, and is sensitive to the parameter of one of the example processes.  相似文献   

16.
17.
A series of new results useful to the study of DNA sequences using Markov models of substitution are presented with proofs. General time-reversible distances can be extended to accommodate any fixed distribution of rates across sites by replacing the logarithmic function of a matrix with the inverse of a moment generating function. Estimators are presented assuming a gamma distribution, the inverse Gaussian distribution, or a mixture of either of these with invariant sites. Also considered are the different ways invariant sites may be removed and how these differences may affect estimated distances. Through collaboration, we implemented these distances into PAUP* in 1994. The variance of these new distances is approximated via the delta method. It is also shown how to predict the divergence expected for a pair of sequences given a rate matrix and a distribution of rates across sites, allowing iterated ML estimates of distances under any reversible model. A simple test of whether a rate matrix is time reversible is also presented. These new methods are used to estimate the divergence time of humans and chimps from mtDNA sequence data. These analyses support suggestions that the human lineage has an enhanced transition rate relative to other hominoids. These studies also show that transversion distances differ substantially from the overall distances which are dominated by transitions. Transversions alone apparently suggest a very recent divergence time for humans versus chimps and/or a very old (>16 myr) divergence time for humans versus organgutans. This work illustrates graphically ways to interpret the reliability of distance-based transformations, using the corrected transition to transversion ratio returned for pairs of sequences which are successively more diverged.  相似文献   

18.
A series of new results useful to the study of DNA sequences using Markov models of substitution are presented with proofs. General time-reversible distances can be extended to accommodate any fixed distribution of rates across sites by replacing the logarithmic function of a matrix with the inverse of a moment generating function. Estimators are presented assuming a gamma distribution, the inverse Gaussian distribution, or a mixture of either of these with invariant sites. Also considered are the different ways invariant sites may be removed and how these differences may affect estimated distances. Through collaboration, we implemented these distances into PAUP* in 1994. The variance of these new distances is approximated via the delta method. It is also shown how to predict the divergence expected for a pair of sequences given a rate matrix and a distribution of rates across sites, allowing iterated ML estimates of distances under any reversible model. A simple test of whether a rate matrix is time reversible is also presented. These new methods are used to estimate the divergence time of humans and chimps from mtDNA sequence data. These analyses support suggestions that the human lineage has an enhanced transition rate relative to other hominoids. These studies also show that transversion distances differ substantially from the overall distances which are dominated by transitions. Transversions alone apparently suggest a very recent divergence time for humans versus chimps and/or a very old (>16 myr) divergence time for humans versus organgutans. This work illustrates graphically ways to interpret the reliability of distance-based transformations, using the corrected transition to transversion ratio returned for pairs of sequences which are successively more diverged.  相似文献   

19.
Summary Under the assumption of unequal rates of nucleotide substitution among the three positions of codons, mathematical formulas are derived for the probability that a restriction site observed at time 0 in a DNA sequence will be present at time t, the probability that a new restriction site will emerge at a particular place in two descendant sequences, and the proportion of identical restriction sites between two such sequences. All three quantitites are shown to be larger for the case of unequal rates than for the case of equal rates. As a consequence, estimates of nucleotide divergence (2t) between two sequences based on restriction-site data tend to be lower than the actual values of the assumption of equal rates is made but actually does not hold. However, the degree of underestimation is slight if 2t is 0.10 or smaller. The underestimation can be serious when 2t becomes 0.20 or larger, particularly if the substitution rates at the three codon positions are very different or if transitional substitution occurs more frequently than transversional substitution. The underestimation is more serious for four-base enzymes than for sixbase enzymes.  相似文献   

20.
Summary Actin genic regions were isolated and characterized from the heterokont-flagellated protists,Achlya bisexualis (Oomycota) andCostaria costata (Chromophyta). Restriction enzyme and cloning experiments suggested that the genes are present in a single copy and sequence determinations revealed the existence of two introns in theC. costata actin genic region. Phylogenetic analyses of actin genic regions using distance matrix and maximum parsimony methods confirmed the close evolutionary relationship ofA. bisexualis andC. costata suggested by ribosomal DNA (rDNA) sequence comparisons and reproductive cell ultrastructure. The higher fungi, green plants, and animals were seen as monophyletic groups; however, a precise order of branching for these assemblages could not be determined. Phylogenetic frameworks inferred from comparisons of rRNAs were used to assess rates of evolution in actin genic regions of diverse eukaryotes. Actin genic regions had nonuniform rates of nucleotide substitution in different lineages. Comparison of rates of actin and rDNA sequence divergence indicated that actin genic regions evolve 2.0 and 5.3 times faster in higher fungi and flowering plants, respectively, than their rDNA sequences. Conversely, animal actins evolve at approximately one-fifth the rate of their rDNA sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号