首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
It is well known among phylogeneticists that adding an extra taxon (e.g. species) to a data set can alter the structure of the optimal phylogenetic tree in surprising ways. However, little is known about this “rogue taxon” effect. In this paper we characterize the behavior of balanced minimum evolution (BME) phylogenetics on data sets of this type using tools from polyhedral geometry. First we show that for any distance matrix there exist distances to a “rogue taxon” such that the BME-optimal tree for the data set with the new taxon does not contain any nontrivial splits (bipartitions) of the optimal tree for the original data. Second, we prove a theorem which restricts the topology of BME-optimal trees for data sets of this type, thus showing that a rogue taxon cannot have an arbitrary effect on the optimal tree. Third, we computationally construct polyhedral cones that give complete answers for BME rogue taxon behavior when our original data fits a tree on four, five, and six taxa. We use these cones to derive sufficient conditions for rogue taxon behavior for four taxa, and to understand the frequency of the rogue taxon effect via simulation.  相似文献   

2.
A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.  相似文献   

3.
A method is presented for removing recent homoplastic events from a phylogenetic tree. This “topiary pruning” method produces a series of progressively modified duplicates of the original set of data, from which more and more of the most recent substitutions have been removed. The edited sets of data have increased amounts of information per remaining taxon, while similar but randomized data sets subjected to topiary pruning do not. The ability of topiary pruning to “unscramble” artificial data sets that have high levels of homoplasy is demonstrated, and is shown to be similar in its effects to the weighting method of Kluge and Farris (1969), although with the additional advantage of reducing the number of taxa to the point where bootstrapping is feasible. Pruning and weighting used together produce closer approximations to the “true” tree than either method used separately. It is further shown that in these artificial data sets midpoint rooting is more likely to be accurate than outgroup rooting. When pruning and weighting are applied to the extensive sets of mitochondrial DNA data of Cann et al. (1987) and Vigilant et al. (1991), trees result that have deep branch points, some of which lead to entirely African branches. In the case of the Vigilant et al. data, the three African branches have bootstrap values between 0.94 and 1.0, and the consensus and bootstrap midpoint roots also have high bootstrap values and occur on these African branches near their junction. An African origin of the human mitochondrial tree is not proved by this approach, particularly since sequences from non-African groups are underrepresented in current data sets, but it is rendered more likely.  相似文献   

4.
Crocodylian systematics has long been confounded by conflicting hypotheses of higher level relationships—although molecular data sets strongly supported the sister-taxon relationship of Tomistoma and Gavialis, morphological data sets placed Gavialis as sister to all other living taxa. One of the perceived difficulties in interpreting morphological character evolution on the molecular tree is the extensive character reversal occurring in Gavialinae, the mechanism of which has yet to be explained. Here, we provide evidence of gavialine-specific atavistic characters from East Asian “tomistomines” Penghusuchus pani and Toyotamaphimeia machikanensis. These taxa exhibit a mosaic assembly of “tomistomine” and gavialine features, which fill the gap between the two longirostrine groups. Although the parsimony analysis of morphological data (69 taxa, 254 characters) still supports the previous morphological hypothesis, the alternative tree that was forced to fit the molecular hypothesis was insignificantly (5/954 steps; 0.52%) longer than the unconstrained tree, suggesting that morphological evolution can also be interpreted on the molecular tree. Although the problem of stratigraphic gaps remains, future studies may be directed to resolving the interrelationships within Gavialoidea, a large longirostrine group of crocodylians, in the molecular tree context.  相似文献   

5.
In a cladistic analysis of Recent seed plants, Loconte and Stevenson (1990) obtained results that conflict with our 1986 analysis of both extant and fossil groups and argued that fossil data had led us to incorrect conclusions. To explore this result and the general influence of fossils on phylogeny reconstruction, we assembled new “Recent” and “Complete” (extant plus fossil) data sets incorporating new data, advances in treatment of characters, and those changes of Loconte and Stevenson that we consider valid. Our Recent analysis yields only one most parsimonious tree, that of Loconte and Stevenson, in which conifers are linked with Gnetales and angiosperms (anthophytes), rather than with Ginkgo, as in our earlier Recent and Complete analyses. However, the shortest trees derived from our Complete analysis show five arrangements of extant groups, including that of Loconte and Stevenson and our previous arrangements, suggesting that the result obtained from extant taxa alone may be misleading. This increased ambiguity occurs because features that appear to unite extant conifers and anthophytes are seen as convergences when fossil taxa are interpolated between them. All trees found in the Complete analysis lead to inferences on character evolution that conflict with those that would be drawn from Recent taxa alone (e.g., origin of anthophytes from plants with a “seed fern” morphology). These results imply that conclusions on many aspects of seed plant phylogeny are premature; new evidence, which is most likely to come from the fossil record, is needed to resolve the uncertainties.  相似文献   

6.
Consensus is elusive regarding the phylogenetic relationships among neornithine (crown clade) birds. The ongoing debate over their deep divergences is despite recent increases in available molecular sequence data and the publication of several larger morphological data sets. In the present study, the phylogenetic relationships among 43 neornithine higher taxa are addressed using a data set of 148 osteological and soft tissue characters, which is one of the largest to date. The Mesozoic non‐neornithine birds Apsaravis, Hesperornis, and Ichthyornis are used as outgroup taxa for this analysis. Thus, for the first time, a broad array of morphological characters (including both cranial and postcranial characters) are analyzed for an ingroup densely sampling Neornithes, with crown clade outgroups used to polarize these characters. The strict consensus cladogram of two most parsimonious trees resultant from 1000 replicate heuristic searches (random stepwise addition, tree‐bisection‐reconnection) recovered several previously identified clades; the at‐one‐time contentious clades Galloanseres (waterfowl, fowl, and allies) and Palaeognathae were supported. Most notably, our analysis recovered monophyly of Neoaves, i.e., all neognathous birds to the exclusion of the Galloanseres, although this clade was weakly supported. The recently proposed sister taxon relationship between Steatornithidae (oilbird) and Trogonidae (trogons) was recovered. The traditional taxon “Falconiformes” (Cathartidae, Sagittariidae, Accipitridae, and Falconidae) was not found to be monophyletic, as Strigiformes (owls) are placed as the sister taxon of (Falconidae + Accipitridae). Monophyly of the traditional “Gruiformes” (cranes and allies) and ”Ciconiiformes” (storks and allies) was also not recovered. The primary analysis resulted in support for a sister group relationship between Gaviidae (loons) and Podicipedidae (grebes)—foot‐propelled diving birds that share many features of the pelvis and hind limb. Exclusion of Gaviidae and reanalysis of the data set, however, recovered the sister group relationship between Phoenicopteridae (flamingos) and grebes recently proposed from molecular sequence data.  相似文献   

7.
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.   相似文献   

8.
It has been claimed that blending processes such as trade and exchange have always been more important in the evolution of cultural similarities and differences among human populations than the branching process of population fissioning. In this paper, we report the results of a novel comparative study designed to shed light on this claim. We fitted the bifurcating tree model that biologists use to represent the relationships of species to 21 biological data sets that have been used to reconstruct the relationships of species and/or higher level taxa and to 21 cultural data sets. We then compared the average fit between the biological data sets and the model with the average fit between the cultural data sets and the model. Given that the biological data sets can be confidently assumed to have been structured by speciation, which is a branching process, our assumption was that, if cultural evolution is dominated by blending processes, the fit between the bifurcating tree model and the cultural data sets should be significantly worse than the fit between the bifurcating tree model and the biological data sets. Conversely, if cultural evolution is dominated by branching processes, the fit between the bifurcating tree model and the cultural data sets should be no worse than the fit between the bifurcating tree model and the biological data sets. We found that the average fit between the cultural data sets and the bifurcating tree model was not significantly different from the fit between the biological data sets and the bifurcating tree model. This indicates that the cultural data sets are not less tree-like than are the biological data sets. As such, our analysis does not support the suggestion that blending processes have always been more important than branching processes in cultural evolution. We conclude from this that, rather than deciding how cultural evolution has proceeded a priori, researchers need to ascertain which model or combination of models is relevant in a particular case and why.  相似文献   

9.
The hyperdiverse genus Sarcophaga Meigen, with about 890 valid species arranged within 169 subgenera, accounts for almost half of the diversity of the subfamily Sarcophaginae. Current phylogenetic hypotheses for this genus are poorly supported or based on small taxon sets, or both. Here, we use molecular data from the genes COI and 28S to reconstruct the phylogeny of Sarcophaga based on the most comprehensive sampling for the group to date: 144 species from 47 subgenera, including representatives from all regional faunas for the first time. Of the total sequences of Sarcophaga used in the present study, 94.7% were newly generated. The secondary structure of the D1–D3 expansion segments of 28S is presented for the first time for the family Sarcophagidae, and is used in a multiple sequence alignment. Branch support and tree resolution increased remarkably through rogue taxa identification and exclusion. Rogue behaviour was explained mostly as a missing data problem. The RogueNaRok web service and the algorithms chkmoves, IterPCR and prunmajor implemented in the computer program TNT were equally good at identifying critical rogue species, but chkmoves and IterPCR also identified rogue clades. Pruning rogues increased the number of monophyletic subgenera in consensus trees from one to six out of 19 subgenera with more than one representative species. Bayesian inference, maximum‐likelihood and parsimony analyses recovered more monophyletic subgenera after the removal of rogue taxa, with parsimony showing the largest improvements in branch support and resolution. Although with low support, Nearctic taxa were found to be the earliest diverging lineages, followed by a subsequent diversification of Old World faunas, which is in agreement with currently available evidence of a New World origin and early diversification of Sarcophaga.  相似文献   

10.
For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power.  相似文献   

11.
Phylogenies—the evolutionary histories of groups of organisms—play a major role in representing the interrelationships among biological entities. Many methods for reconstructing and studying such phylogenies have been proposed, almost all of which assume that the underlying history of a given set of species can be represented by a binary tree. Although many biological processes can be effectively modeled and summarized in this fashion, others cannot: recombination, hybrid speciation, and horizontal gene transfer result in networks of relationships rather than trees of relationships. In previous works, we formulated a maximum parsimony (MP) criterion for reconstructing and evaluating phylogenetic networks, and demonstrated its quality on biological as well as synthetic data sets. In this paper, we provide further theoretical results as well as a very fast heuristic algorithm for the MP criterion of phylogenetic networks. In particular, we provide a novel combinatorial definition of phylogenetic networks in terms of “forbidden cycles,” and provide detailed hardness and hardness of approximation proofs for the "small” MP problem. We demonstrate the performance of our heuristic in terms of time and accuracy on both biological and synthetic data sets. Finally, we explain the difference between our model and a similar one formulated by Nguyen et al., and describe the implications of this difference on the hardness and approximation results.  相似文献   

12.
Selecting a non‐redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non‐redundant training sets for sequence and structural models or selection of “operational taxonomic units” from metagenomics data. Previous methods for this task, such as CD‐HIT, PISCES, and UCLUST, apply a heuristic threshold‐based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.  相似文献   

13.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

14.
To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate and combined DNA data sets for 190 angiosperms and three outgroups. Separate data sets of 18S rDNA (1,855 bp), rbcL (1,428 bp), and atpB (1,450 bp) sequences were combined into a single matrix 4,733 bp in length. Analyses of the combined data set show great improvements in computer run times compared to those of the separate data sets and of the data sets combined in pairs. Six searches of the 18S rDNA + rbcL + atpB data set were conducted; in all cases TBR branch swapping was completed, generally within a few days. In contrast, TBR branch swapping was not completed for any of the three separate data sets, or for the pairwise combined data sets. These results illustrate that it is possible to conduct a thorough search of tree space with large data sets, given sufficient signal. In this case, and probably most others, sufficient signal for a large number of taxa can only be obtained by combining data sets. The combined data sets also have higher internal support for clades than the separate data sets, and more clades receive bootstrap support of > or = 50% in the combined analysis than in analyses of the separate data sets. These data suggest that one solution to the computational and analytical dilemmas posed by large data sets is the addition of nucleotides, as well as taxa.  相似文献   

15.
Supertree methods are used to construct a large tree over a large set of taxa from a set of small trees over overlapping subsets of the complete taxa set. Since accurate reconstruction methods are currently limited to a maximum of a few dozen taxa, the use of a supertree method in order to construct the tree of life is inevitable. Supertree methods are broadly divided according to the input trees: When the input trees are unrooted, the basic reconstruction unit is a quartet tree. In this case, the basic decision problem of whether there exists a tree that agrees with all quartets is NP-complete. On the other hand, when the input trees are rooted, the basic reconstruction unit is a rooted triplet and the above decision problem has a polynomial time algorithm. However, when there is no tree which agrees with all triplets, it would be desirable to find the tree that agrees with the maximum number of triplets. However, this optimization problem was shown to be NP-hard. Current heuristic approaches perform min cut on a graph representing the triplets inconsistency and return a tree that is guaranteed to satisfy some required properties. In this work, we present a different heuristic approach that guarantees the properties provided by the current methods and give experimental evidence that it significantly outperforms currently used methods. This method is based on a divide and conquer approach, where the min cut in the divide step is replaced by a max cut in a variant of the same graph. The latter is achieved by a lightweight semidefinite programming-like heuristic that leads to very fast running times  相似文献   

16.
The evolutionary history of certain species such as polyploids are modeled by a generalization of phylogenetic trees called multi-labeled phylogenetic trees, or MUL trees for short. One problem that relates to inferring a MUL tree is how to construct the smallest possible MUL tree that is consistent with a given set of rooted triplets, or SMRT problem for short. This problem is NP-hard. There is one algorithm for the SMRT problem which is exact and runs in time, where is the number of taxa. In this paper, we show that the SMRT does not seem to be an appropriate solution from the biological point of view. Indeed, we present a heuristic algorithm named MTRT for this problem and execute it on some real and simulated datasets. The results of MTRT show that triplets alone cannot provide enough information to infer the true MUL tree. So, it is inappropriate to infer a MUL tree using triplet information alone and considering the minimum number of duplications. Finally, we introduce some new problems which are more suitable from the biological point of view.  相似文献   

17.
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(naka) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. (2002, Syst. Biol. 51: 652-664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.  相似文献   

18.
Inference of haplotypes is important in genetic epidemiology studies. However, all large genotype data sets have errors due to the use of inexpensive genotyping machines that are fallible and shortcomings in genotyping scoring softwares, which can have an enormous impact on haplotype inference. In this article, we propose two novel strategies to reduce the impact induced by genotyping errors in haplotype inference. The first method makes use of double sampling. For each individual, the “GenoSpectrum” that consists of all possible genotypes and their corresponding likelihoods are computed. The second method is a genotype clustering algorithm based on multi‐genotyping data, which also assigns a “GenoSpectrum” for each individual. We then describe two hybrid EM algorithms (called DS‐EM and MG‐EM) that perform haplotype inference based on “GenoSpectrum” of each individual obtained by double sampling and multi‐genotyping data. Both simulated data sets and a quasi real‐data set demonstrate that our proposed methods perform well in different situations and outperform the conventional EM algorithm and the HMM algorithm proposed by Sun, Greenwood, and Neal (2007, Genetic Epidemiology 31 , 937–948) when the genotype data sets have errors.  相似文献   

19.
Abstract — Morphological characters from sabethine mosquitoes were coded from larvae, pupae and adults, and life-stage partitions were evaluated to determine the contribution of each to the topology of a combined cladogram. Initial tests failed to find congruence between characters partitioned by life stage. However, when components from the combined analysis were tested using reduced taxon sets, a high degree of concordance between partitions was observed. A procedure for assessing individual life-stage contribution is employed, in which exhaustive searches are used to explore all possible arrangements for each of the selected components. Seven of the 10 components examined were able to recover the combined topology with a reduced taxon set. Congruent arrangements of taxa were typically observed for two or more life stages, although partitioned data were less resolved and frequently included aberrant topologies (those not supported by other partitioned or combined reduced taxon tree sets). In addition, none of the partitioned data sets gave robust results for all tests, suggesting that studies which emphasize character data from single life stages may support misleading arrangements of taxa. One component on the combined cladogram was not supported by any of the life-stage partitions when analysed separately. These results are complementary to “total evidence” approach, and demonstrate that partitions of data are useful for examining suits of characters which may cause some components of the “total  相似文献   

20.
Phylogenetic comparative methods have long considered phylogenetic signal as a source of statistical bias in the correlative analysis of biological traits. However, the main life-history strategies existing in a set of taxa are often combinations of life history traits that are inherently phylogenetically structured. In this paper, we present a method for identifying evolutionary strategies from large sets of biological traits, using phylogeny as a source of meaningful historical and ecological information. Our methodology extends a multivariate method developed for the analysis of spatial patterns, and relies on finding combinations of traits that are phylogenetically autocorrelated. Using extensive simulations, we show that our method efficiently uncovers phylogenetic structures with respect to various tree topologies, and remains powerful in cases where a large majority of traits are not phylogenetically structured. Our methodology is illustrated using empirical data, and implemented in the adephylo package for the free software R.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号