首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 940 毫秒
1.
Phylogenetic studies based on DNA sequences typically ignore the potential occurrence of recombination, which may produce different alignment regions with different evolutionary histories. Traditional phylogenetic methods assume that a single history underlies the data. If recombination is present, can we expect the inferred phylogeny to represent any of the underlying evolutionary histories? We examined this question by applying traditional phylogenetic reconstruction methods to simulated recombinant sequence alignments. The effect of recombination on phylogeny estimation depended on the relatedness of the sequences involved in the recombinational event and on the extent of the different regions with different phylogenetic histories. Given the topologies examined here, when the recombinational event was ancient, or when recombination occurred between closely related taxa, one of the two phylogenies underlying the data was generally inferred. In this scenario, the evolutionary history corresponding to the majority of the positions in the alignment was generally recovered. Very different results were obtained when recombination occurred recently among divergent taxa. In this case, when the recombinational breakpoint divided the alignment in two regions of similar length, a phylogeny that was different from any of the true phylogenies underlying the data was inferred.  相似文献   

2.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.  相似文献   

3.
Abstract

Molecular sequence data have become prominent tools for phylogenetic relationship inference, particularly useful in the analysis of highly diverse taxonomic orders. Ribosomal RNA sequences provide markers that can be used in the study of phylogeny, because their function and structure have been conserved to a large extent throughout the evolutionary history of organisms. These sequences are inferred from cloned or enzymatically amplified gene sequences, or determined by direct RNA sequencing. The first step of the phylogenetic interpretation of nucleic acid sequence variations implies proper alignment of corresponding sequences from various organisms. Best alignment based on similarity criteria is greatly reinforced, in the case of ribosomal RNAs, by secondary structure homologies. Distance matrix methods to infer evolutionary trees are based on the assumption that the phylogenetic distance between each pair of organisms is proportional to the number of nucleotide substitution events. Computed tree inference methods usually take into consideration the possibility of unequal mutation rates among lineages. Divergence times can be estimated on the tree, provided that at least one lineage has been dated by fossil records. We have utilized this approach based on ribosomal RNA sequence comparison to investigate the phylogenetic relationship between dinoflagellated and other eukaryote protists, and to refine controverse phylogenies of the class Dinophycae.  相似文献   

4.
Alignment of nucleotide and/or amino acid sequences is a fundamental component of sequence‐based molecular phylogenetic studies. Here we examined how different alignment methods affect the phylogenetic trees that are inferred from the alignments. We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment and direct optimization. When taking into account branch support, implied alignments produced by direct optimization were found to show the most extreme behaviour (based on the alignment programs for which nearly equivalent alignment parameters could be set) in that they provided the strongest support for the correct tree in the simulations in which it was easy to resolve the correct tree and the strongest support for the incorrect tree in our long‐branch‐attraction simulations. When applied to alignment‐sensitive process partitions with different histories, direct optimization showed the strongest mutual influence between the process partitions when they were aligned and phylogenetically analysed together, which makes detecting recombination more difficult. Simultaneous alignment performed well relative to direct optimization and progressive pairwise alignment across all simulations. Rather than relying upon methods that integrate alignment and tree search into a single step without accounting for alignment uncertainty, as with implied alignments, we suggest that simultaneous alignment using the similarity criterion, within the context of information available on biological processes and function, be applied whenever possible for sequence‐based phylogenetic analyses.  相似文献   

5.
The phylogeny of carabid tribes is examined with sequences of 18S ribosomal DNA from eighty-four carabids representing forty-seven tribes, and fifteen outgroup taxa. Parsimony, distance and maximum likelihood methods are used to infer the phylogeny. Although many clades established with morphological evidence are present in all analyses, many of the basal relationships in carabids vary from analysis to analysis. These deeper relationships are also sensitive to variation in the sequence alignment under different alignment conditions. There is moderate evidence against the monophyly of Migadopini + Amarotypini, Scaritini + Clivinini, Bembidiini and Brachinini. Psydrini are not monophyletic, and consist of three distinct lineages (Psydrus, Laccocenus and a group of austral psydrines, from the Southern Hemisphere consisting of all the subtribes excluding Psydrina). The austral psydrines are related to Harpalinae plus Brachinini. The placements of many lineages, including Gehringia, Apotomus, Omophron, Psydrus and Cymbionotum, are unclear from these data. One unexpected placement, suggested with moderate support, is Loricera as the sister group to Amarotypus. Trechitae plus Patrobini form a monophyletic group. Brachinini probably form the sister group to Harpalinae, with the latter containing Pseudomorpha, Morion and Cnemalobus. The most surprising, well supported result is the placement of four lineages (Cicindelinae, Rhysodinae, Paussinae and Scaritini) as near relatives of Harpalinae + Brachinini. Because these four lineages all have divergent 18S rDNA, and thus have long basal branches, parametric bootstrapping was conducted to determine if their association and placement could be the result of long branch attraction. Simulations on model trees indicate that, although their observed association might be due to long branch attraction, there was no evidence that their placement near Harpalinae could be so explained. These simulations also suggest that 18S rDNA might not be sufficient to infer basal carabid relationships.  相似文献   

6.

Background  

Non-parametric bootstrapping is a widely-used statistical procedure for assessing confidence of model parameters based on the empirical distribution of the observed data [1] and, as such, it has become a common method for assessing tree confidence in phylogenetics [2]. Traditional non-parametric bootstrapping does not weigh each tree inferred from resampled (i.e., pseudo-replicated) sequences. Hence, the quality of these trees is not taken into account when computing bootstrap scores associated with the clades of the original phylogeny. As a consequence, traditionally, the trees with different bootstrap support or those providing a different fit to the corresponding pseudo-replicated sequences (the fit quality can be expressed through the LS, ML or parsimony score) contribute in the same way to the computation of the bootstrap support of the original phylogeny.  相似文献   

7.
We applied a novel strategy to infer sequence circularity and complete assembly of four mitochondrial genomes (mitogenomes) of the frog families Bufonidae (Melanophryniscus moreirae), Dendrobatidae (Hyloxalus subpunctatus and Phyllobates terribilis), and Scaphiopodidae (Scaphiopus holbrookii). These are the first complete mitogenomes of these four genera and Scaphiopodidae. We assembled mitogenomes from short genomic sequence reads using a baiting and iterative mapping strategy followed by a new ad hoc mapping strategy developed to test for assembly circularization. To assess the quality of the inferred circularization, we used Bowtie2 alignment scores and a new per‐position sequence coverage value (which we named “connectivity”). Permutation tests with 400 iterations per specimen and 1% or 5% chance of mutation at the ends of the putative circular sequences showed that the proposed method is highly sensitive, with a single nucleotide insertion or deletion being sufficient for circularity to be rejected. False positives comprised only 2% of all observations and possessed significantly lower alignment scores. The size, gene content, and gene arrangement of each mitogenome differed among the species but matched the expectations for their clades. We argue that basic studies on circular sequences can benefit from the results and bioinformatics procedures introduced here, especially when closely related references are lacking.  相似文献   

8.
Direct optimization (DO) of 126 nuclear‐encoded SSU rRNA diatom sequences was conducted. The optimal phylogeny indicated several unique relationships with respect to those recovered from a maximum likelihood (ML) analysis of an alignment based on maximizing primary and secondary structural similarity between 126 nuclear‐encoded SSU rRNA diatom sequences ( Medlin and Kaczmarska, 2004 ). Dividing diatoms into the subdivisions Coscinodiscophytina and Bacillariophytina was not supported by the DO phylogeny, due to the paraphyly of the former. The same pertains to Coscinodiscophyceae, Mediophyceae, Thalassiosira, Fragilaria and Amphora. The ordinal‐level classification of the diatoms proposed by Round et al. (1990 ) was for the most part found to be unsupported. The DO phylogeny represented a more rigorous hypothesis than the ML tree because DO maximized character congruence during the homology testing (i.e., alignment/tree search) process whereas the non‐phylogenetic similarity‐based alignment used in the ML analysis did not. The above statement is supported by “controlled” parsimony analyses of 35 sequences, which strongly suggested that dissimilarities in the DO and ML tree structure were due to the specific homology testing approach used. It could not be precluded that differences in taxon sampling and the use of a dissimilar optimality criteria contributed to discrepancies in the structure of the optimal ML and DO trees.  相似文献   

9.
序列比对是生物信息学中的一项重要任务,通过序列比对可以发现生物序列中的功能、结构和进化的信息。序列比对结果的生物学意义与所选择的匹配、不匹配、插入和删除以及空隙的罚分函数密切相关。现介绍一种参数序列比对方法,该方法把最佳比对作为权值和罚分的函数,可以系统地得到参数的选择对最佳比对结果的影响。然后将其应用于RNA序列比对,分析不同的参数选择对序列比对结果的影响。最后指出参数序列比对算法的应用以及未来的发展方向。  相似文献   

10.
Ascidians exhibit two different modes of development. A tadpole larva is formed during urodele development, whereas the larval phase is modified or absent during anural development. Anural development is restricted to a small number of species in one or possibly two ascidian families and is probably derived from ancestors with urodele development. Anural and urodele ascidians constitute a model system in which to study the evolution of development, but the phylogeny of anural development has not been resolved. Classification based on larval characters suggests that anural species are monophyletic, whereas classification according to adult morphology suggests they are polyphyletic. In the present study, we have inferred the origin of anural development using rDNA sequences. The central region of 18S rDNA and the hypervariable D2 loop of 28S rDNA were amplified from the genomic DNA of anural and urodele ascidian species by the polymerase chain reaction and sequenced. Phylogenetic trees inferred from 18S rDNA sequences of 21 species placed anural developers into two discrete groups corresponding to the Styelidae and Molgulidae, suggesting that anural development evolved independently in these families. Furthermore, the 18S rDNA trees inferred at least four independent origins of anural development in the family Molgulidae. Phylogenetic trees inferred from the D2 loop sequences of 13 molgulid species confirmed the 18S rDNA phylogeny. Anural development appears to have evolved rapidly because some anural species are placed as closely related sister groups to urodele species. The phylogeny inferred from rDNA sequences is consistent with molgulid systematics according to adult morphology and supports the polyphyletic origin of anural development in ascidians. Correspondence to: W.R. Jeffery  相似文献   

11.
The reconstruction of phylogenetic history is predicated on being able to accurately establish hypotheses of character homology, which involves sequence alignment for studies based on molecular sequence data. In an empirical study investigating nucleotide sequence alignment, we inferred phylogenetic trees for 43 species of the Apicomplexa and 3 of Dinozoa based on complete small-subunit rDNA sequences, using six different multiple-alignment procedures: manual alignment based on the secondary structure of the 18S rRNA molecule, and automated similarity-based alignment algorithms using the PileUp, ClustalW, TreeAlign, MALIGN, and SAM computer programs. Trees were constructed using neighboring-joining, weighted-parsimony, and maximum- likelihood methods. All of the multiple sequence alignment procedures yielded the same basic structure for the estimate of the phylogenetic relationship among the taxa, which presumably represents the underlying phylogenetic signal. However, the placement of many of the taxa was sensitive to the alignment procedure used; and the different alignments produced trees that were on average more dissimilar from each other than did the different tree-building methods used. The multiple alignments from the different procedures varied greatly in length, but aligned sequence length was not a good predictor of the similarity of the resulting phylogenetic trees. We also systematically varied the gap weights (the relative cost of inserting a new gap into a sequence or extending an already-existing gap) for the ClustalW program, and this produced alignments that were at least as different from each other as those produced by the different alignment algorithms. Furthermore, there was no combination of gap weights that produced the same tree as that from the structure alignment, in spite of the fact that many of the alignments were similar in length to the structure alignment. We also investigated the phylogenetic information content of the helical and nonhelical regions of the rDNA, and conclude that the helical regions are the most informative. We therefore conclude that many of the literature disagreements concerning the phylogeny of the Apicomplexa are probably based on differences in sequence alignment strategies rather than differences in data or tree-building methods.   相似文献   

12.
Finding correct species relationships using phylogeny reconstruction based on molecular data is dependent on several empirical and technical factors. These include the choice of DNA sequence from which phylogeny is to be inferred, the establishment of character homology within a sequence alignment, and the phylogeny algorithm used. Nevertheless, sequencing and phylogeny tools provide a way of testing certain hypotheses regarding the relationship among the organisms for which phenotypic characters demonstrate conflicting evolutionary information. The protozoan family Sarcocystidae is one such group for which molecular data have been applied phylogenetically to resolve questionable relationships. However, analyses carried out to date, particularly based on small-subunit ribosomal DNA, have not resolved all of the relationships within this family. Analysis of more than one gene is necessary in order to obtain a robust species signal, and some DNA sequences may not be appropriate in terms of their phylogenetic information content. With this in mind, we tested the informativeness of our chosen molecule, the large-subunit ribosomal DNA (lsu rDNA), by using subdivisions of the sequence in phylogenetic analysis through PAUP, fastDNAml, and neighbor joining. The segments of sequence applied correspond to areas of higher nucleotide variation in a secondary-structure alignment involving 21 taxa. We found that subdivision of the entire lsu rDNA is inappropriate for phylogenetic analysis of the Sarcocystidae. There are limited informative nucleotide sites in the lsu rDNA for certain clades, such as the one encompassing the subfamily Toxoplasmatinae. Consequently, the removal of any segment of the alignment compromises the final tree topology. We also tested the effect of using two different alignment procedures (CLUSTAL W and the structure alignment using DCSE) and three different tree-building methods on the final tree topology. This work shows that congruence between different methods in the formation of clades may be a feature of robust topology; however, a sequence alignment based on primary structure may not be comparing homologous nucleotides even though the expected topology is obtained. Our results support previous findings showing the paraphyly of the current genera Sarcocystis and Hammondia and again bring to question the relationships of Sarcocystis muris, Isospora felis, and Neospora caninum. In addition, results based on phylogenetic analysis of the structure alignment suggest that Sarcocystis zamani and Sarcocystis singaporensis, which have reptilian definitive hosts, are monophyletic with Sarcocystis species using mammalian definitive hosts if the genus Frenkelia is synonymized with Sarcocystis.  相似文献   

13.
Many outstanding questions about dinoflagellate evolution can potentially be resolved by establishing a robust phylogeny. To do this, we generated a data set of mitochondrial cytochrome b (cob) and mitochondrial cytochrome c oxidase 1 (cox1) from a broad range of dinoflagellates. Maximum likelihood, maximum parsimony, and Bayesian methods were used to infer phylogenies from these genes separately and as a concatenated alignment with and without small subunit (SSU) rDNA sequences. These trees were largely congruent in topology with previously published phylogenies but revealed several unexpected results. Prorocentrum benthic and planktonic species previously placed in different clusters formed a monophyletic group in all trees, suggesting that the Prorocentrales is a monophyletic group. More strikingly, our analyses placed Amphidinium and Heterocapsa as early splits among dinoflagellates that diverged after the emergence of O. marina. This affiliation received strong bootstrap support, but these lineages exhibited relatively long branches. The approximately unbiased (AU-) test was used to assess this result using a three-gene (cob + cox1 + SSU rDNA) DNA data set and the inferred tree. This analysis showed that forcing Amphidinium or Heterocapsa to relatively more derived positions in the phylogeny resulted in significantly lower likelihood scores, consistent with the phylogenies. The position of these lineages needs to be further verified. Reviewing Editor: Dr. Martin Kreitman  相似文献   

14.
Liang LJ  Weiss RE 《Biometrics》2007,63(3):733-741
Phylogenetic modeling is computationally challenging and most phylogeny models fit a single phylogeny to a single set of molecular sequences. Individual phylogenetic analyses are typically performed independently using publicly available software that fits a computationally intensive Bayesian model using Markov chain Monte Carlo (MCMC) simulation. We develop a Bayesian hierarchical semiparametric regression model to combine multiple phylogenetic analyses of HIV-1 nucleotide sequences and estimate parameters of interest within and across analyses. We use a mixture of Dirichlet processes as a prior for the parameters to relax inappropriate parametric assumptions and to ensure the prior distribution for the parameters is continuous. We use several reweighting algorithms for combining completed MCMC analyses to shrink parameter estimates while adjusting for data set-specific covariates. This avoids constructing a large complex model involving all the original data, which would be computationally challenging and would require rewriting the existing stand-alone software.  相似文献   

15.
Over 3000 microbial (bacterial and archaeal) genomes have been made publically available to date, providing an unprecedented opportunity to examine evolutionary genomic trends and offering valuable reference data for a variety of other studies such as metagenomics. The utility of these genome sequences is greatly enhanced when we have an understanding of how they are phylogenetically related to each other. Therefore, we here describe our efforts to reconstruct the phylogeny of all available bacterial and archaeal genomes. We identified 24, single-copy, ubiquitous genes suitable for this phylogenetic analysis. We used two approaches to combine the data for the 24 genes. First, we concatenated alignments of all genes into a single alignment from which a Maximum Likelihood (ML) tree was inferred using RAxML. Second, we used a relatively new approach to combining gene data, Bayesian Concordance Analysis (BCA), as implemented in the BUCKy software, in which the results of 24 single-gene phylogenetic analyses are used to generate a “primary concordance” tree. A comparison of the concatenated ML tree and the primary concordance (BUCKy) tree reveals that the two approaches give similar results, relative to a phylogenetic tree inferred from the 16S rRNA gene. After comparing the results and the methods used, we conclude that the current best approach for generating a single phylogenetic tree, suitable for use as a reference phylogeny for comparative analyses, is to perform a maximum likelihood analysis of a concatenated alignment of conserved, single-copy genes.  相似文献   

16.
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.  相似文献   

17.
We consider the effects of fully or partially random sequences on the estimation of four-taxon phylogenies. Fully or partially random sequences occur when whole subsets of sequences or some sites for subsets of sequences are independent of sequence data for the other taxa. Random sequences can be a consequence of misalignment or because sites evolve at very fast rates in some portions of a tree, a situation that occurs especially in analyses involving deep divergence times. One might reasonably speculate that random sites will only add noise to the estimation of a phylogeny. We show that in the case that a random sequence is added to a three-taxa alignment, it is more likely to be a neighbor of the sequence corresponding to the longest branch in the three-taxon tree. Surprisingly, when only about half of the sites show randomness, a long-branch-repels form of small sample bias occurs, and when a minority of sites show randomness this becomes a long-branch-attraction bias again. The most serious bias, one that does not vanish with increasing sequence length, occurs when more than one sequence is partially random. If there is a large amount of overlap in the random sites for two sequences, those two sequences will be attracted to each other; otherwise, they will repel each other. Random sequences or sites can, therefore, cause complicated biases in phylogenetic inference. We suggest performing analyses with and without potentially saturated sequences and/or misaligned sites, to check that these biases are not affecting the inferred branching pattern.[Reviewing Editor: Dr. J. Rasmus Nielson]  相似文献   

18.

Background  

Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both.  相似文献   

19.
We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot.  相似文献   

20.
The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号