首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This study describes novel algorithms for searching for most parsimonious trees. These algorithms are implemented as a parsimony computer program, PARSIGAL, which performs well even with difficult data sets. For high level search, PARSIGAL uses an evolutionary optimization algorithm, which feeds good tree candidates to a branch-swapping local search procedure. This study also describes an extremely fast method of recomputing state sets for binary characters (additive or nonadditive characters with two states), based on packing 32 characters into a single memory word and recomputing the tree simultaneously for all 32 characters using fast bitwise logical operations. The operational principles of PARSIGAL are quite different from those previously published for other parsimony computer programs. Hence it is conceivable that PARSIGAL may be able to locate islands of trees that are different from those that are easily located with existing parsimony computer programs.  相似文献   

2.
We consider the problem of reconstructing near-perfect phylogenetic trees using binary character states (referred to as BNPP). A perfect phylogeny assumes that every character mutates at most once in the evolutionary tree, yielding an algorithm for binary character states that is computationally efficient but not robust to imperfections in real data. A near-perfect phylogeny relaxes the perfect phylogeny assumption by allowing at most a constant number of additional mutations. We develop two algorithms for constructing optimal near-perfect phylogenies and provide empirical evidence of their performance. The first simple algorithm is fixed parameter tractable when the number of additional mutations and the number of characters that share four gametes with some other character are constants. The second, more involved algorithm for the problem is fixed parameter tractable when only the number of additional mutations is fixed. We have implemented both algorithms and shown them to be extremely efficient in practice on biologically significant data sets. This work proves the BNPP problem fixed parameter tractable and provides the first practical phylogenetic tree reconstruction algorithms that find guaranteed optimal solutions while being easily implemented and computationally feasible for data sets of biologically meaningful size and complexity.  相似文献   

3.
New algorithms for calculating the most parsimonious state sets for polytomies under Fitch parsimony are described. Because they are based on state set operations, these algorithms can be extended for optimization of several characters in parallel, thus increasing speed by a significant factor. This speed increase may facilitate analysis of molecular data sets, many of which contain hundreds of taxa, thousands of multistate nonadditive characters, and numerous polytomies.  相似文献   

4.
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.   相似文献   

5.
Recent studies have shown that addition or deletion of taxa from a data matrix can change the estimate of phylogeny. I used 29 data sets from the literature to examine the effect of taxon sampling on phylogeny estimation within data sets. I then used multiple regression to assess the effect of number of taxa, number of characters, homoplasy, strength of support, and tree symmetry on the sensitivity of data sets to taxonomic sampling. Sensitivity to sampling was measured by mapping characters from a matrix of culled taxa onto optimal trees for that reduced matrix and onto the pruned optimal tree for the entire matrix, then comparing the length of the reduced tree to the length of the pruned complete tree. Within-data-set patterns can be described by a second-order equation relating fraction of taxa sampled to sensitivity to sampling. Multiple regression analyses found number of taxa to be a significant predictor of sensitivity to sampling; retention index, number of informative characters, total support index, and tree symmetry were nonsignificant predictors. I derived a predictive regression equation relating fraction of taxa sampled and number of taxa potentially sampled to sensitivity to taxonomic sampling and calculated values for this equation within the bounds of the variables examined. The length difference between the complete tree and a subsampled tree was generally small (average difference of 0-2.9 steps), indicating that subsampling taxa is probably not an important problem for most phylogenetic analyses using up to 20 taxa.  相似文献   

6.
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved.  相似文献   

7.
Abstract — Several algorithms to speed up branch swapping searches for most parsimonious trees are described. The method for indirect tree length calculation when moving a clipped clade, based on final states for the divided tree, is expanded to take into account polymorphic characters, and to include the possibility of rejecting several locations as suboptimal by checking just one node. Three different algorithms for faster estimation of final state assignments for the divided tree based on calculations for the whole tree are presented. The first of these is approximate; it uses information from the final state sets for the whole tree. The second is exact, but it is slower than the first, and requires more memory; it is based on the union of the state sets of the descendants for each node. The third is also exact; it requires more memory and programming effort than the other two but it is faster, it is based on final and preliminary state sets for the whole tree ("incremental two-pass optimization"). Efficient ways to derive state assignments for collapsing trees, based on final states for the divided tree, are described. The recently proposed method of "incremental optimization" is discussed. It is likely that searches using that method will be no faster than searches using indirect calculation as originally described, and will be quite slower than the modified indirect calculation described here. Searches using that method will probably be significantly slowed down when zero-length branches are to be collapsed, since shortcuts for faster collapsing are not directly applicable.  相似文献   

8.
Molecular and morphological data sets have yielded conflicting phylogenies for the Metazoa. So far, no general explanation for the existence of this conflict has been suggested. However, I believe that a neglected aspect of metazoan cladistics has introduced a systematic and substantial bias into morphological phylogenetic analyses. Most characters used for metazoan cladistics are coded as binary absence/presence characters. For most of these characters, the absence states are assumed to be uninformative default plesiomorphies, if they are defined at all. This character coding strategy could seriously underestimate the number of informative apomorphic absences or secondary character losses. Because nodes in morphological metazoan phylogenies are typically supported by relatively small numbers of characters each with a potentially strong impact on tree topology, failure to distinguish between primary absence and secondary loss of characters before a cladistic analysis may mislead morphological cladistics. This may falsely suggest conflict with molecular phylogenies, which are not sensitive to this bias. To test the existence of this bias, I compare the phylogenetic placement of a variety of metazoan taxa in molecular and morphological trees. In all instances investigated here, phylogenetic conflict can be resolved by allowing for secondary loss of morphological characters, which were assumed to be primitively absent in cladistic analyses. These findings suggest that we should be cautious in interpreting the results of morphological metazoan cladistic analyses and additionally illustrate the value of a more functional approach to comparative morphology in certain circumstances.  相似文献   

9.
Tree search and its more complicated variant, tree search and simultaneous multiple DNA sequence alignment, are difficult NP-complete optimization problems, which require the application of advanced computational techniques, if large data sets are to be solved within reasonable computation times. Traditionally tree search has been attacked with a search strategy that is best described as multistart hill-climbing; local search by branch swapping has been performed on several different starting trees. Recently a different tree search strategy was tested in the Parsigal parsimony program, which used a combination of evolutionary optimization and local search. Evolutionary optimization algorithms use principles adopted from biological evolution to solve technical optimization tasks. Evolutionary optimization is a stochastic global search method, which means that the method is able to escape local optima, and is in principle able to produce any solution in the search space (although this may take a long time). Local search techniques, such as branch swapping, employ a completely different search strategy; they exploit local information maximally in order to achieve quick improvement in the value of the objective function. However, local search algorithms lack the ability to escape from local optima, which is a fundamental requirement for any search algorithm that aims to be able to discover the global optimum of a multimodal optimization problem. Hence it seems that an optimization strategy combining the good properties of both evolutionary algorithms and local search would be ideal. In this study, aspects of global optimization and local search are discussed, and the method of simulated evolutionary optimization is reviewed in detail. The application of simulated evolutionary optimization to tree search in Parsigal is then reviewed briefly.  相似文献   

10.
Vos RA 《Systematic biology》2003,52(3):368-373
The existence of multiple likelihood maxima necessitates algorithms that explore a large part of the tree space. However, because of computational constraints, stepwise addition-based tree-searching methods do not allow for this exploration in reasonable time. Here, I present an algorithm that increases the speed at which the likelihood landscape can be explored. The iterative algorithm combines the computational speed of distance-based tree construction methods to arrive at approximations of the global optimum with the accuracy of optimality criterion based branch-swapping methods to improve on the result of the starting tree. The algorithm moves between local optima by iteratively perturbing the tree landscape through a process of reweighting randomly drawn samples of the underlying sequence data set. Tests on simulated and real data sets demonstrated that the optimal solution obtained using stepwise addition-based heuristic searches was found faster using the algorithm presented here. Tests on a previously published data set that established the presence of tree islands under maximum likelihood demonstrated that the algorithm identifies the same tree islands in a shorter amount of time than that needed using stepwise addition. The algorithm can be readily applied using standard software for phylogenetic inference.  相似文献   

11.
Likelihood applications have become a central approach for molecular evolutionary analyses since the first computationally tractable treatment two decades ago. Although Felsenstein's original pruning algorithm makes likelihood calculations feasible, it is usually possible to take advantage of repetitive structure present in the data to arrive at even greater computational reductions. In particular, alignment columns with certain similarities have components of the likelihood calculation that are identical and need not be recomputed if columns are evaluated in an optimal order. We develop an algorithm for exploiting this speed improvement via an application of graph theory. The reductions provided by the method depend on both the tree and the data, but typical savings range between 15%and 50%. Real-data examples with time reductions of 80%have been identified. The overhead costs associated with implementing the algorithm are minimal, and they are recovered in all but the smallest data sets. The modifications will provide faster likelihood algorithms, which will allow likelihood methods to be applied to larger sets of taxa and to include more thorough searches of the tree topology space.  相似文献   

12.
To tree or not to tree   总被引:2,自引:1,他引:1  
The practice of tracking geographical divergence along a phylogenetic tree has added an evolutionary perspective to biogeographic analysis within single species. In spite of the popularity of phylogeography, there is an emerging problem. Recurrent mutation and recombination both create homoplasy, multiple evolutionary occurrences of the same character that are identical in state but not identical by descent. Homoplasic molecular data are phylogenetically ambiguous. Converting homoplasic molecular data into a tree represents an extrapolation, and there can be myriad candidate trees among which to choose. Derivative biogeographic analyses of 'the tree' are analyses of that extrapolation, and the results depend on the tree chosen. I explore the informational aspects of converting a multicharacter data set into a phylogenetic tree, and then explore what happens when that tree is used for population analysis. Three conclusions follow: (i) some trees are better than others; good trees are true to the data, whereas bad trees are not; (ii) for biogeographic analysis, we should use only good trees, which yield the same biogeographic inference as the phenetic data, but little more; and (iii) the reliable biogeographic inference is inherent in the phenetic data, not the trees.  相似文献   

13.
The “tendency” for homoplasy to appear in closely related taxa has been widely discussed but rarely quantified. This paper proposes statistical tests that examine the topological distribution of homoplasy within characters in phylogenies. They test whether character changes are localized (confined to some subtree), or clustered (occur in proximity to each other), relative to two null models of character evolution. Null Model I assumes that the observed number of character changes are dispersed randomly among the internodes of the tree, whereas Model II weights the probability that an internode contains a change by the length of that internode—estimated by the total number of character changes along that internode. Localization is measured by the largest furthest-neighbor distance between changes, clustering by the mean nearest neighbor distance. Distances are measured either by the number of intervening branches or the number of intervening character changes. Analyses of four cladistic data sets from the literature reveal very few characters that exhibit significant levels of clustering or localization—no more than would be expected by chance. In every data set a majority of characters exhibited at least weak tendencies, but in only one data set was there a significant excess of such characters. The present findings do not provide compelling evidence for the existence of “tendencies” in homoplasy, at least among characters used to reconstruct phylogenies. They should be sought elsewhere, in cladistic analyses of larger scope, probably among a class of characters defined a priori on a structural or functional basis.  相似文献   

14.
Background: The frequency of small subtrees in biological, social, and other types of networks could shed light into the structure, function, and evolution of such networks. However, counting all possible subtrees of a prescribed size can be computationally expensive because of their potentially large number even in small, sparse networks. Moreover, most of the existing algorithms for subtree counting belong to the subtree-centric approaches, which search for a specific single subtree type at a time, potentially taking more time by searching again on the same network. Methods: In this paper, we propose a network-centric algorithm (MTMO) to efficiently count k-size subtrees. Our algorithm is based on the enumeration of all connected sets of k1 edges, incorporates a labeled rooted tree data structure in the enumeration process to reduce the number of isomorphism tests required, and uses an array-based indexing scheme to simplify the subtree counting method. Results: The experiments on three representative undirected complex networks show that our algorithm is roughly an order of magnitude faster than existing subtree-centric approaches and base network-centric algorithm which does not use rooted tree, allowing for counting larger subtrees in larger networks than previously possible. We also show major differences between unicellular and multicellular organisms. In addition, our algorithm is applied to find network motifs based on pattern growth approach. Conclusions: A network-centric algorithm which allows for a faster counting of non-induced subtrees is proposed. This enables us to count larger motif in larger networks than previously.  相似文献   

15.
Phylogenetic tree estimation plays a critical role in a wide variety of molecular studies, including molecular systematics, phylogenetics, and comparative genomics. Finding the optimal tree relating a set of sequences using score-based (optimality criterion) methods, such as maximum likelihood and maximum parsimony, may require all possible trees to be considered, which is not feasible even for modest numbers of sequences. In practice, trees are estimated using heuristics that represent a trade-off between topological accuracy and speed. I present a series of novel algorithms suitable for score-based phylogenetic tree reconstruction that demonstrably improve the accuracy of tree estimates while maintaining high computational speeds. The heuristics function by allowing the efficient exploration of large numbers of trees through novel hill-climbing and resampling strategies. These heuristics, and other computational approximations, are implemented for maximum likelihood estimation of trees in the program Leaphy, and its performance is compared to other popular phylogenetic programs. Trees are estimated from 4059 different protein alignments using a selection of phylogenetic programs and the likelihoods of the tree estimates are compared. Trees estimated using Leaphy are found to have equal to or better likelihoods than trees estimated using other phylogenetic programs in 4004 (98.6%) families and provide a unique best tree that no other program found in 1102 (27.1%) families. The improvement is particularly marked for larger families (80 to 100 sequences), where Leaphy finds a unique best tree in 81.7% of families.  相似文献   

16.
Phylogenetic analysis is becoming an increasingly important tool for biological research. Applications include epidemiological studies, drug development, and evolutionary analysis. Phylogenetic search is a known NP-Hard problem. The size of the data sets which can be analyzed is limited by the exponential growth in the number of trees that must be considered as the problem size increases. A better understanding of the problem space could lead to better methods, which in turn could lead to the feasible analysis of more data sets. We present a definition of phylogenetic tree space and a visualization of this space that shows significant exploitable structure. This structure can be used to develop search methods capable of handling much larger data sets.  相似文献   

17.
Combining data sets with different phylogenetic histories   总被引:1,自引:0,他引:1  
The possibility that two data sets may have different underlying phylogenetic histories (such as gene trees that deviate from species trees) has become an important argument against combining data in phylogenetic analysis. However, two data sets sampled for a large number of taxa may differ in only part of their histories. This is a realistic scenario and one in which the relative advantages of combined, separate, and consensus analysis become much less clear. I propose a simple methodology for dealing with this situation that involves (1) partitioning the available data to maximize detection of different histories, (2) performing separate analyses of the data sets, and (3) combining the data but considering questionable or unresolved those parts of the combined tree that are strongly contested in the separate analyses (and which therefore may have different histories) until a majority of unlinked data sets support one resolution over another. In support of this methodology, computer simulations suggest that (1) the accuracy of combined analysis for recovering the true species phylogeny may exceed that of either of two separately analyzed data sets under some conditions, particularly when the mismatch between phylogenetic histories is small and the estimates of the underlying histories are imperfect (few characters, high homoplasy, or both) and (2) combined analysis provides a poor estimate of the species tree in areas of the phylogenies with different histories but gives an improved estimate in regions that share the same history. Thus, when there is a localized mismatch between the histories of two data sets, the separate, consensus, and combined analyses may all give unsatisfactory results in certain parts of the phylogeny. Similarly, approaches that allow data combination only after a global test of heterogeneity will suffer from the potential failings of either separate or combined analysis, depending on the outcome of the test. Excision of conflicting taxa is also problematic, in that doing so may obfuscate the position of conflicting taxa within a larger tree, even when their placement is congruent between data sets. Application of the proposed methodology to molecular and morphological data sets for Sceloporus lizards is discussed.  相似文献   

18.
19.
Abstract Absolute criteria for evaluating cladistic analyses are useful, not only because cladistic algorithms impose structure, but also because applications of cladistic results demand some assessment of the degree of corroboration of the cladogram. Here, a means of quantitative evaluation is presented based on tree length. The length of the most-parsimonious tree reflects the degree to which the observed characters co-vary such that a single tree topology can explain shared character states among the taxa. This “cladistic covariation” can be quantified by comparing the length of the most parsimonious tree for the observed data set to that found for data sets with random covariation of characters. A random data set is defined as one in which the original number of characters and their character states are maintained, but for each character, the states are randomly reassigned to the taxa. The cladistic permutation tail probability, PTP, is defined as the estimate of the proportion of times that a tree can be found as short or shorter than the original tree. Significant cladistic covariation exists if the PTP is less than a prescribed value, for example, 0.05. In case studies based on molecular and morphological data sets, application of the PTP shows that:
  • 1 In the comparison of four different molecular data sets for orders of mammals, the sequence data set for alpha hemoglobin does not have significant cladistic covariation, while that for alpha crystallin is highly significant. However, when each data set was reduced to the 11 common taxa in order to standardize comparison, reduced levels of cladistic covariation, with no clear superiority of the alpha crystallin data, were found. Morphological data for these 11 taxa had a highly significant PTP, producing a tree roughly congruent with those for the three molecular sets with marginal or significant PTP values. Merging of all data sets, with the exclusion of the poorly structured alpha hemoglobin data, produced a data set with a significant PTP, and provides an estimate of the phylogenetic relationships among these 11 orders of mammals.
  • 2 In an analysis of lactalbumin and lysozyme DNA sequence data for four taxa, purine-pyrimidine coding yields a data set with significant cladistic covariation, while other codings fail. The data for codon position 3 taken alone exhibit the strongest cladistic covariation.
  • 3 A data set based on flavonoids in taxa of Polygonum initially yields a significant PTP; however, deletion of identically scored taxa leaves no significant cladistic covariation.
  • 4 For mitochondrial DNA data on population genome types for four species of the crested newt, there is significant cladistic covariation for the set of all genome types, and among the five mtDNA genome types within one of the species. However, a conditional PTP test that assumes species monophyly shows that no significant cladistic covariation exists among the fur species for these data.
  • 5 In an application of the test to a group of freshwater insects, as preliminary to biological monitoring, individual subsets of the taxonomic data representing larval, pupal, and adult stages had non-significant PTPs, while the complete data set showed significant cladistic structure.
  相似文献   

20.
THE EFFECT OF ORDERED CHARACTERS ON PHYLOGENETIC RECONSTRUCTION   总被引:2,自引:0,他引:2  
Abstract Morphological structures are likely to undergo more than a single change during the course of evolution. As a result, multistate characters are common in systematic studies and must be dealt with. Particularly interesting is the question of whether or not multistate characters should be treated as ordered (additive) or unordered (non-additive). In accepting a particular hypothesis of order, numerous others are necessarily rejected. We review some of the criteria often used to order character states and the underlying assumptions inherent in these criteria.
The effects that ordered multistate characters can have on phylogenetic reconstruction are examined using 27 data sets. It has been suggested that hypotheses of character state order are more informative then hypotheses of unorder and may restrict the number of equally parsimonious trees as well as increase tree resolution. Our results indicate that ordered characters can produce more, equal or less equally parsimonious trees and can increase, decrease or have no effect on tree resolution. The effect on tree resolution can be a simple gain in resolution or a dramatic change in sister-taxa relationships. In cases where several outgroups are included in the data matrix, hypotheses of order can change character polarities by altering outgroup topology. Ordered characters result in a different topology from unordered characters only when the hierarchy of the cladogram disagrees with the investigator's a priori hypothesis of order. If the best criterion for assessing character evolution is congruence with other characters, the practice of ordering multistate characters is inappropriate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号