首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.  相似文献   

2.
Insertions and deletions are responsible for gaps in aligned nucleotide sequences, but they have been usually ignored when the number of nucleotide substitutions was estimated. We compared six sets of nuclear and mitochondrial noncoding DNA sequences of primates and obtained the estimates of the evolutionary rate of insertion and deletion. The maximum-parsimony principle was applied to locate insertions and deletions on a given phylogenetic tree. Deletions were about twice as frequent as insertions for nuclear DNA, and single-nucleotide insertions and deletions were the most frequent in all events. The rate of insertion and deletion was found to be rather constant among branches of the phylogenetic tree, and the rate (approximately 2.0/kb/Myr) for mitochondrial DNA was found to be much higher than that (approximately 0.2/kb/Myr) for nuclear DNA. The rates of nucleotide substitution were about 10 times higher than the rate of insertion and deletion for both nuclear and mitochondrial DNA.   相似文献   

3.
Nucleotide insertions and deletions (indels) are responsible for gaps in the sequence alignments. Indel is one of the major sources of evolutionary change at the molecular level. We have examined the patterns of insertions and deletions in the 19 mammalian genomes, and found that deletion events are more common than insertions in the mammalian genomes. Both the number of insertions and deletions decrease rapidly when the gap length increases and single nucleotide indel is the most frequent in all indel events. The frequencies of both insertions and deletions can be described well by power law.Key Words: Insertion, deletion, gap, indel, mammalian genome.  相似文献   

4.
Opinions split when it comes to the significance and thus the weighting of indel characters as phylogenetic markers. This paper attempts to test the phylogenetic information content of indels and nucleotide substitutions by proposing an a priori weighting system of non-protein-coding genes. Theoretically, the system rests on a weighting scheme which is based on a falsificationist approach to cladistic inference. It provides insertions, deletions and nucleotide substitutions weights according to their specific number of identical classes of potential falsifiers, resulting in the following system: nucleotide substitutions weight = 3, deletions of n nucleotides weight = (2n–1), and insertions of n nucleotides weight = (5n–1). This weighting system and the utility of indels as phylogenetic markers are tested against a suitable data set of 18S rDNA sequences of Diptera and Strepsiptera taxa together with other Metazoa species. The indels support the same clades as the nucleotide substitution data, and the application of the weighting system increases the corresponding consistency indices of the differentially weighted character types. As a consequence, applying the weighting system seems to be reasonable, and indels appear to be good phylogenetic markers.  相似文献   

5.
Inching toward reality: An improved likelihood model of sequence evolution   总被引:3,自引:0,他引:3  
Summary Our previous evolutionary model is generalized to permit approximate treatment of multiple-base insertions and deletions as well as regional heterogeneity of substitution rates. Parameter estimation and alignment procedures that incorporate these generalizations are developed. Simulations are used to assess the accuracy of the parameter estimation procedure and an example of an inferred alignment is included. Offprint requests to: J.L. Thorne  相似文献   

6.
The horizontal gene transfer (HGT) being inferred within prokaryotic genomes appears to be sufficiently massive that many scientists think it may have effectively obscured much of the history of life recorded in DNA. Here, we demonstrate that the tree of life can be reconstructed even in the presence of extensive HGT, provided the processes of genome evolution are properly modeled. We show that the dynamic deletions and insertions of genes that occur during genome evolution, including those introduced by HGT, may be modeled using techniques similar to those used to model nucleotide substitutions that occur during sequence evolution. In particular, we show that appropriately designed general Markov models are reasonable tools for reconstructing genome evolution. These studies indicate that, provided genomes contain sufficiently many genes and that the Markov assumptions are met, it is possible to reconstruct the tree of life. We also consider the fusion of genomes, a process not encountered in gene sequence evolution, and derive a method for the identification and reconstruction of genome fusion events. Genomic reconstructions of a well-defined classical four-genome problem, the root of the multicellular animals, show that the method, when used in conjunction with paralinear/logdet distances, performs remarkably well and is relatively unaffected by the recently discovered big genome artifact.  相似文献   

7.
Qian B  Goldstein RA 《Proteins》2001,45(1):102-104
Protein sequence alignment has become a widely used method in the study of newly sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. Although affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious oversimplification of the real processes that occur during sequence evolution. To improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wanted to find a more accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity < 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multiexponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.  相似文献   

8.
MOTIVATION: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods. RESULTS: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.  相似文献   

9.
Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.  相似文献   

10.
SUMMARY: BAli-Phy is a Bayesian posterior sampler that employs Markov chain Monte Carlo to explore the joint space of alignment and phylogeny given molecular sequence data. Simultaneous estimation eliminates bias toward inaccurate alignment guide-trees, employs more sophisticated substitution models during alignment and automatically utilizes information in shared insertion/deletions to help infer phylogenies. AVAILABILITY: Software is available for download at http://www.biomath.ucla.edu/msuchard/bali-phy.  相似文献   

11.
Sequence alignment underpins common tasks in molecular biology, including genome annotation, molecular phylogenetics, and homology modeling. Fundamental to sequence alignment is the placement of gaps, which represent character insertions or deletions. We assessed the ability of a generalized affine gap cost model to reliably detect remote protein homology and to produce high-quality alignments. Generalized affine gap alignment with optimal gap parameters performed as well as the traditional affine gap model in remote homology detection. Evaluation of alignment quality showed that the generalized affine model aligns fewer residue pairs than the traditional affine model but achieves significantly higher per-residue accuracy. We conclude that generalized affine gap costs should be used when alignment accuracy carries more importance than aligned sequence length.  相似文献   

12.
Summary The common but generally overlooked problem of how best to construct phylogenies from orthologous amino acid sequences, when their alignment requires the placement therein of gaps denoting insertions/deletions in the evolutionary history of their genes since their common ancestor, has been studied. Three diverse methods were examined: 1. each missing residue in a gap is weighted as equivalent to the average number of minimum nucleotide replacements in known conjugate amino acid pairs of those same two sequences, which weight necessarily differs for each pair of sequences; 2. each missing residue in a gap is weighted as equivalent to a fixed number of nucleotide replacements; and 3. each gap, regardless of length, is weighted as equivalent to a fixed number of nucleotide replacements. For the flavodoxins, each method yielded a different best tree and suggests that the choice of method may be crucial. For the plant ferredoxins, all methods give results inconsistent with botanical classification and suggests the sequences may not all be orthologous. For the bacterial ferredoxins, the method was less germane than the actual weight used, five different best trees being obtained depending upon the weight. The best tree for all ferredoxins (prokaryotic plus eukaryotic) combined proved to be greatly dependent upon the gap locations with several reasonable alignments yielding different best trees. They also suggest that functional equivalence may well prove to be a poor guide to which residues have a common ancestral codon. The rubredoxin sequences show that a partial internal gene duplication occurred in thePseudomonas line, probably very soon after its divergence from the other genera. Together, the results clearly indicate that the phylogenetic answer one gets may greatly depend upon how one treats the gaps but they fail to indicate what treatment may be best. This results partly from the fact that the phylogenies of the taxa represented are not known with sufficient confidence to be sure when the procedures are performing best.  相似文献   

13.
Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.  相似文献   

14.
To study the mechanisms for local evolutionary changes in DNA sequences involving slippage-type insertions and deletions, an alignment approach is explored that can consider the posterior probabilities of alignment models. Various patterns of insertion and deletion that can link the ancestor and descendant sequences are proposed and evaluated by simulation and compared by the Markov chain Monte Carlo (MCMC) method. Analyses of pseudogenes reveal that the introduction of the parameters that control the probability of slippage-type events markedly augments the probability of the observed sequence evolution, arguing that a cryptic involvement of slippage occurrences is manifested as insertions and deletions of short nucleotide segments. Strikingly, approximately 80% of insertions in human pseudogenes and approximately 50% of insertions in murids pseudogenes are likely to be caused by the slippage-mediated process, as represented by BC in ABCD --> ABCBCD. We suggest that, in both human and murids, even very short repetitive motifs, such as CAGCAG, CACACA, and CCCC, have approximately 10- to 15-fold susceptibility to insertions and deletions, compared to nonrepetitive sequences. Our protocol, namely, indel-MCMC, thus seems to be a reasonable approach for statistical analyses of the early phase of microsatellite evolution.  相似文献   

15.
Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.  相似文献   

16.
ABSTRACT The sequence variation within the group I intron in five Naegleria spp. was studied and compared with the sequence variation within the flanking small subunit ribosomal DNA. Considerable sequence divergence was observed in the introns as well as in the rDNA. In the intron deletions and insertions are only detected in the sequence contributing to the secondary structure, not in the open reading frame. Most of the sequence variation is detected in the unpaired loops. In the case of nucleotide substitution in helices, compensating base pair changes were observed. The sequence variation does not induce variation in the secondary structure model. The phylogenetic tree based on the intron sequences is similar to the tree based on the flanking rDNA sequences. This observation indicates that the intron might have been acquired at an early stage in evolution, and lost in the majority of Naegleria spp.  相似文献   

17.
Selecting the best-fit model of nucleotide substitution   总被引:2,自引:0,他引:2  
Despite the relevant role of models of nucleotide substitution in phylogenetics, choosing among different models remains a problem. Several statistical methods for selecting the model that best fits the data at hand have been proposed, but their absolute and relative performance has not yet been characterized. In this study, we compare under various conditions the performance of different hierarchical and dynamic likelihood ratio tests, and of Akaike and Bayesian information methods, for selecting best-fit models of nucleotide substitution. We specifically examine the role of the topology used to estimate the likelihood of the different models and the importance of the order in which hypotheses are tested. We do this by simulating DNA sequences under a known model of nucleotide substitution and recording how often this true model is recovered by the different methods. Our results suggest that model selection is reasonably accurate and indicate that some likelihood ratio test methods perform overall better than the Akaike or Bayesian information criteria. The tree used to estimate the likelihood scores does not influence model selection unless it is a randomly chosen tree. The order in which hypotheses are tested, and the complexity of the initial model in the sequence of tests, influence model selection in some cases. Model fitting in phylogenetics has been suggested for many years, yet many authors still arbitrarily choose their models, often using the default models implemented in standard computer programs for phylogenetic estimation. We show here that a best-fit model can be readily identified. Consequently, given the relevance of models, model fitting should be routine in any phylogenetic analysis that uses models of evolution.  相似文献   

18.
The phylogenetic diversification of Hexapoda is still not fully understood. Morphological and molecular analyses have resulted in partly contradicting hypotheses. In molecular analyses, 18S sequences are the most frequently employed, but it appears that 18S sequences do not contain enough phylogenetic signals to resolve basal relationships of hexapod lineages. Until recently, character interdependence in these data has never been treated seriously, though possibly accounting for the occurrence of biased results. However, software packages are readily available which can incorporate information on character interdependence within a Bayesian approach. Accounting for character covariation derived from a hexapod consensus secondary structure model and applying mixed DNA/RNA substitution models, our Bayesian analysis of 321 hexapod sequences yielded a partly robust tree that depicts many hexapod relationships congruent with morphological considerations. It appears that the application of mixed DNA/RNA models removes many of the anomalies seen in previous studies. We focus on basal hexapod relationships for which unambiguous results are missing. In particular, the strong support for a “Chiastomyaria” clade (Ephemeroptera+Neoptera) obtained in Kjer's [2004. Aligned 18S and insect phylogeny. Syst. Biol. 53, 1–9] study of 18S sequences could not be confirmed by our analysis. The hexapod tree can be rooted with monophyletic Entognatha but not with a clade Ellipura (Collembola+Protura). Compared to previously published contributions, accounting for character interdependence in analyses of rRNA data presents an improvement of phylogenetic resolution. We suggest that an integration of explicit clade-specific rRNA structural refinements is not only possible but an important step in the optimization of substitution models dealing with rRNA data.  相似文献   

19.
Exact and heuristic algorithms for the Indel Maximum Likelihood Problem.   总被引:1,自引:0,他引:1  
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals.  相似文献   

20.
Using real sequence data, we evaluate the adequacy of assumptions made in evolutionary models of nucleotide substitution and the effects that these assumptions have on estimation of evolutionary trees. Two aspects of the assumptions are evaluated. The first concerns the pattern of nucleotide substitution, including equilibrium base frequencies and the transition/transversion-rate ratio. The second concerns the variation of substitution rates over sites. The maximum-likelihood estimate of tree topology appears quite robust to both these aspects of the assumptions of the models, but evaluation of the reliability of the estimated tree by using simpler, less realistic models can be misleading. Branch lengths are underestimated when simpler models of substitution are used, but the underestimation caused by ignoring rate variation over nucleotide sites is much more serious. The goodness of fit of a model is reduced by ignoring spatial rate variation, but unrealistic assumptions about the pattern of nucleotide substitution can lead to an extraordinary reduction in the likelihood. It seems that evolutionary biologists can obtain accurate estimates of certain evolutionary parameters even with an incorrect phylogeny, while systematists cannot get the right tree with confidence even when a realistic, and more complex, model of evolution is assumed.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号