首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

2.
We develop a new approach to estimate a matrix of pairwise evolutionary distances from a codon-based alignment based on a codon evolutionary model. The method first computes a standard distance matrix for each of the three codon positions. Then these three distance matrices are weighted according to an estimate of the global evolutionary rate of each codon position and averaged into a unique distance matrix. Using a large set of both real and simulated codon-based alignments of nucleotide sequences, we show that this approach leads to distance matrices that have a significantly better treelikeness compared to those obtained by standard nucleotide evolutionary distances. We also propose an alternative weighting to eliminate the part of the noise often associated with some codon positions, particularly the third position, which is known to induce a fast evolutionary rate. Simulation results show that fast distance-based tree reconstruction algorithms on distance matrices based on this codon position weighting can lead to phylogenetic trees that are at least as accurate as, if not better, than those inferred by maximum likelihood. Finally, a well-known multigene dataset composed of eight yeast species and 106 codon-based alignments is reanalyzed and shows that our codon evolutionary distances allow building a phylogenetic tree which is similar to those obtained by non-distance-based methods (e.g., maximum parsimony and maximum likelihood) and also significantly improved compared to standard nucleotide evolutionary distance estimates.  相似文献   

3.
SWAPSC: sliding window analysis procedure to detect selective constraints   总被引:3,自引:0,他引:3  
Sliding-window analysis procedure to detect selective constraints (SWAPSC) is a software system to dissect the constraints on the evolution of protein-coding genes. The program estimates rates of nucleotide substitutions at specific codon regions in each branch of a phylogenetic tree. The program uses several sets of simulated sequence alignments to estimate the probability of synonymous and non-synonymous nucleotide substitutions. Thereafter, a statistical analysis is conducted to determine the optimum window size to detect selective constraints. Finally, the optimum window size is slid along the real alignment and a test for significance of the estimated number of synonymous and non-synonymous nucleotide substitutions in each sliding step is conducted. A number of friendly useful output files is generated. AVAILABILITY: SWAPSC is available at http://www.may.ie/academic/biology/staff/mfmolecevolandbioinf.shtml distribution versions for both Linux and Windows operating systems are available, including manual and example files.  相似文献   

4.
New methods for estimating the numbers of synonymous and nonsynonymous substitutions per site were developed. The methods are unweighted pathway methods based on Kimura's two-parameter model. Computer simulations were conducted to evaluate the accuracies of the new methods, Nei and Gojobori's (NG) method, Miyata and Yasunaga's (MY) method, Li, Wu, and Luo's (LWL) method, and Pamilo, Bianchi, and Li's (PBL) method. The following results were obtained: (1) The NG, MY, and LWL methods give overestimates of the number of synonymous substitutions and underestimates of the number of nonsynonymous substitutions. The major cause for the biased estimation is that these three methods underestimate the number of synonymous sites and overestimate the number of nonsynonymous sites. (2) The PBL method gives better estimates of the numbers of synonymous and nonsynonymous substitutions than those obtained by the NG, MY, and LWL methods. (3) The new methods also give better estimates of the numbers of synonymous and nonsynonymous substitutions than those obtained by the NG, MY, and LWL methods. In addition, estimates of the numbers of synonymous and nonsynonymous sites obtained by the new methods are reasonably accurate. (4) In some cases, the new methods and the PBL method give biased estimates of substitution numbers. However, from the number of nucleotide substitutions at the third position of codons, we can examine whether estimates obtained by the new methods are good or not, whereas we cannot make an examination of estimates obtained by the PBL method. (5) When there are strong transition/transversion and nucleotide-frequency biases like mitochondrial genes, all of the above methods give biased estimates of substitution numbers. In such cases, Kondo et al.'s method is recommended to be used for estimating the number of synonymous substitutions, although their method cannot estimate the number of nonsynonymous substitutions and is time-consuming. These results, particularly result (1), call for reexaminations of some genes. This is because evolutionary pictures of genes have often been discussed on the basis of results obtained by the NG, MY, and LWL methods, which are favorable for the neutral theory of molecular evolution.  相似文献   

5.
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.  相似文献   

6.
Current phylogenetic methods attempt to account for evolutionary rate variation across characters in a matrix. This is generally achieved by the use of sophisticated evolutionary models, combined with dense sampling of large numbers of characters. However, systematic biases and superimposed substitutions make this task very difficult. Model adequacy can sometimes be achieved at the cost of adding large numbers of free parameters, with each parameter being optimized according to some criterion, resulting in increased computation times and large variances in the model estimates. In this study, we develop a simple approach that estimates the relative evolutionary rate of each homologous character. The method that we describe uses the similarity between characters as a proxy for evolutionary rate. In this article, we work on the premise that if the character-state distribution of a homologous character is similar to many other characters, then this character is likely to be relatively slowly evolving. If the character-state distribution of a homologous character is not similar to many or any of the rest of the characters in a data set, then it is likely to be the result of rapid evolution. We show that in some test cases, at least, the premise can hold and the inferences are robust. Importantly, the method does not use a "starting tree" to make the inference and therefore is tree independent. We demonstrate that this approach can work as well as a maximum likelihood (ML) approach, though the ML method needs to have a known phylogeny, or at least a very good estimate of that phylogeny. We then demonstrate some uses for this method of analysis, including the improvement in phylogeny reconstruction for both deep-level and recent relationships and overcoming systematic biases such as base composition bias. Furthermore, we compare this approach to two well-established methods for reweighting or removing characters. These other methods are tree-based and we show that they can be systematically biased. We feel this method can be useful for phylogeny reconstruction, understanding evolutionary rate variation, and for understanding selection variation on different characters.  相似文献   

7.
Here we present a new sliding window-based method specially designed to detect selective constraints in specific regions of a multiple protein-coding sequence alignment. In contrast to previous window-based procedures, our method is based on a nonarbitrary statistical approach to find the appropriate codon-window size to test deviations of synonymous (dS) and nonsynonymous (dN) nucleotide substitutions from the expectation. The probabilities of dN and dS are obtained from simulated data and used to detect significant deviations of dN and dS in a specific window region of the real sequence alignment. The nonsynonymous-to-synonymous rate ratio (w = dN/dS) was used to highlight selective constraints in any window wherein dS or dN was significantly different from the expectation. In these significant windows, w and its variance [V(w)] were calculated and used to test the neutral hypothesis. Computer simulations showed that the method is accurate even for highly divergent sequences. The main advantages of the new method are that it (i) uses a statistically appropriate window size to detect different selective patterns, (ii) is computationally less intensive than maximum likelihood methods, and (iii) detects saturation of synonymous sites, which can give deviations from neutrality. Hence, it allows the analysis of highly divergent sequences and the test of different alternative hypothesis as well. The application of the method to different human immunodeficiency virus type 1 and to foot-and-mouth disease virus genes confirms the action of positive selection on previously described regions as well as on new regions.  相似文献   

8.

Background

Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results.

Results

We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances.

Conclusions

Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.  相似文献   

9.
Several studies have demonstrated high levels of sequence conservation in noncoding DNA compared between two species (e.g., human and mouse), and interpreted this conservation as evidence for functional constraints. If this interpretation is correct, it suggests the existence of a hidden class of abundant regulatory elements. However, much of the noncoding sequence conserved between two species may result from chance or from small-scale heterogeneity in mutation rates. Stronger inferences are expected from sequence comparisons using more than two taxa, and by testing for spatial patterns of conservation in addition to primary sequence similarity. We used a Bayesian local alignment method to compare approximately 10 kb of intron sequence from nine genes in a pairwise manner between human, whale, and seal to test whether the degree and pattern of conservation is consistent with neutral divergence. Comparison of the three sets of conserved gapless pairwise blocks revealed the following patterns: The proportion of identical intron nucleotides averaged 47% in pairwise comparisons and 28% across the three taxa. Proportions of conserved sequence were similar in unique sequence and general mammalian repetitive elements. We simulated sequence evolution under a neutral model using published estimates of substitution rate heterogeneity for noncoding DNA and found pairwise identity at 33% and three-taxon identity at 16% of nucleotide sites. Spatial patterns of primary sequence conservation were also nonrandomly distributed within introns. Overall, segments of intron sequence closer to flanking exons were significantly more conserved than interior intron sequence. This level of intron sequence conservation is above that expected by chance and strongly suggests that intron sequences are playing a larger functional role in gene regulation than previously realized.  相似文献   

10.
The Haseman and Elston (H-E) method uses a simple linear regression to model the squared trait difference of sib pairs with the shared allele identical by descent (IBD) at marker locus for linkage testing. Under this setting, the squared mean-corrected trait sum is also linearly related to the IBD sharing. However, the resulting slope estimate for either model is not efficient. In this report, we propose a simple linkage test that optimally uses information from the estimates of both models. We also demonstrate that the new test is more powerful than both the traditional one and the recently revisited H-E methods.  相似文献   

11.
MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.  相似文献   

12.
Identifying adaptive genetic divergence among populations from genome scans   总被引:26,自引:0,他引:26  
The identification of signatures of natural selection in genomic surveys has become an area of intense research, stimulated by the increasing ease with which genetic markers can be typed. Loci identified as subject to selection may be functionally important, and hence (weak) candidates for involvement in disease causation. They can also be useful in determining the adaptive differentiation of populations, and exploring hypotheses about speciation. Adaptive differentiation has traditionally been identified from differences in allele frequencies among different populations, summarised by an estimate of FST. Low outliers relative to an appropriate neutral population-genetics model indicate loci subject to balancing selection, whereas high outliers suggest adaptive (directional) selection. However, the problem of identifying statistically significant departures from neutrality is complicated by confounding effects on the distribution of FST estimates, and current methods have not yet been tested in large-scale simulation experiments. Here, we simulate data from a structured population at many unlinked, diallelic loci that are predominantly neutral but with some loci subject to adaptive or balancing selection. We develop a hierarchical-Bayesian method, implemented via Markov chain Monte Carlo (MCMC), and assess its performance in distinguishing the loci simulated under selection from the neutral loci. We also compare this performance with that of a frequentist method, based on moment-based estimates of FST. We find that both methods can identify loci subject to adaptive selection when the selection coefficient is at least five times the migration rate. Neither method could reliably distinguish loci under balancing selection in our simulations, even when the selection coefficient is twenty times the migration rate.  相似文献   

13.
14.
The recurrent fixation of newly arising, beneficial mutations in a species reduces levels of linked neutral variability. Models positing frequent weakly beneficial substitutions or, alternatively, rare, strongly selected substitutions predict similar average effects on linked neutral variability, if the product of the rate and strength of selection is held constant. We propose an approximate Bayesian (ABC) polymorphism-based estimator that can be used to distinguish between these models, and apply it to multi-locus data from Drosophila melanogaster. We investigate the extent to which inference about the strength of selection is sensitive to assumptions about the underlying distributions of the rates of substitution and recombination, the strength of selection, heterogeneity in mutation rate, as well as the population''s demographic history. We show that assuming fixed values of selection parameters in estimation leads to overestimates of the strength of selection and underestimates of the rate. We estimate parameters for an African population of D. melanogaster (ŝ∼2E−03, ) and compare these to previous estimates. Finally, we show that surveying larger genomic regions is expected to lend much more discriminatory power to the approach. It will thus be of great interest to apply this method to emerging whole-genome polymorphism data sets in many taxa.  相似文献   

15.

Background

This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input’s discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment.

Results

To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs’ alignments are closest to the model trees.

Conclusions

By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.
  相似文献   

16.
17.
To test whether gaps resulting from sequence alignment contain phylogenetic signal concordant with those of base substitutions, we analyzed the occurrence of indel mutations upon a well-resolved, substitution-based tree for three nuclear genes in bumble bees (Bombus, Apidae: Bombini). The regions analyzed were exon and intron sequences of long-wavelength rhodopsin (LW Rh), arginine kinase (ArgK), and elongation factor-1alpha (EF-1alpha) F2 copy genes. LW Rh intron had only a few uninformative gaps, ArgK intron had relatively long gaps that were easily aligned, and EF-1alpha intron had many short gaps, resulting in multiple optimal alignments. The unambiguously aligned gaps within ArgK intron sequences showed no homoplasy upon the substitution-based tree, and phylogenetic signals within ambiguously aligned regions of EF-1alpha intron were highly congruent with those of base substitutions. We further analyzed the contribution of gap characters to phylogenetic reconstruction by incorporating them in parsimony analysis. Inclusion of gap characters consistently improved support for nodes recovered by substitutions, and inclusion of ambiguously aligned regions of EF-1alpha intron resolved several additional nodes, most of which were apical on the phylogeny. We conclude that gaps are an exceptionally reliable source of phylogenetic information that can be used to corroborate and refine phylogenies hypothesized by base substitutions, at least at lower taxonomic levels. At present, full use of gaps in phylogenetic reconstruction is best achieved in parsimony analysis, pending development of well-justified and generally applicable methods for incorporating indels in explicitly model-based methods.  相似文献   

18.
Dispersal is a key parameter of adaptation, invasion and persistence. Yet standard population genetics inference methods hardly distinguish it from drift and many species cannot be studied by direct mark‐recapture methods. Here, we introduce a method using rates of change in cline shapes for neutral markers to estimate contemporary dispersal. We apply it to the devastating banana pest Mycosphaerella fijiensis, a wind‐dispersed fungus for which a secondary contact zone had previously been detected using landscape genetics tools. By tracking the spatio‐temporal frequency change of 15 microsatellite markers, we find that σ, the standard deviation of parent–offspring dispersal distances, is 1.2 km/generation1/2. The analysis is further shown robust to a large range of dispersal kernels. We conclude that combining landscape genetics approaches to detect breaks in allelic frequencies with analyses of changes in neutral genetic clines offers a powerful way to obtain ecologically relevant estimates of dispersal in many species.  相似文献   

19.
MOTIVATION: A quantitative study of molecular evolutionary events such as substitutions, insertions and deletions from closely related genomes requires (1) an accurate multiple sequence alignment program and (2) a method to annotate the insertions and deletions that explain the 'gaps' in the alignment. Although the former requirement has been extensively addressed, the latter problem has received little attention, especially in a comprehensive probabilistic framework. RESULTS: Here, we present Indelign, a program that uses a probabilistic evolutionary model to compute the most likely scenario of insertions and deletions consistent with an input multiple alignment. It is also capable of modifying the given alignment so as to obtain a better agreement with the evolutionary model. We find close to optimal performance and substantial improvement over alternative methods, in tests of Indelign on synthetic data. We use Indelign to analyze regulatory sequences in Drosophila, and find an excess of insertions over deletions, which is different from what has been reported for neutral sequences. AVAILABILITY: The Indelign program may be downloaded from the website http://veda.cs.uiuc.edu/indelign/ SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.  相似文献   

20.
A "Long Indel" model for evolutionary sequence alignment   总被引:7,自引:0,他引:7  
We present a new probabilistic model of sequence evolution, allowing indels of arbitrary length, and give sequence alignment algorithms for our model. Previously implemented evolutionary models have allowed (at most) single-residue indels or have introduced artifacts such as the existence of indivisible "fragments." We compare our algorithm to these previous methods by applying it to the structural homology dataset HOMSTRAD, evaluating the accuracy of (1) alignments and (2) evolutionary time estimates. With our method, it is possible (for the first time) to integrate probabilistic sequence alignment, with reliability indicators and arbitrary gap penalties, in the same framework as phylogenetic reconstruction. Our alignment algorithm requires that we evaluate the likelihood of any specific path of mutation events in a continuous-time Markov model, with the event times integrated out. To this effect, we introduce a "trajectory likelihood" algorithm (Appendix A). We anticipate that this algorithm will be useful in more general contexts, such as Markov Chain Monte Carlo simulations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号