首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 681 毫秒
1.
A classical result in phylogenetic trees is that a binary phylogenetic tree adhering to the molecular clock hypothesis exists if and only if the matrix of distances between taxa is ultrametric. The ultrametric condition is very restrictive. In this paper we study phylogenetic networks that can be constructed assuming the molecular clock hypothesis. We characterize distance matrices that admit such networks for 3 and 4 taxa. We also design two algorithms for constructing networks optimizing the least-squares fit.  相似文献   

2.
Chor B  Snir S 《Systematic biology》2004,53(6):963-967
Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees, but finding the global optimum is a hard computational task. Because no general analytic solution is known, numeric techniques such as hill climbing or expectation maximization (EM) are used in order to find optimal parameters for a given tree. So far, analytic solutions were derived only for the simplest model-three-taxa, two-state characters, under a molecular clock. Quoting Ziheng Yang, who initiated the analytic approach,"this seems to be the simplest case, but has many of the conceptual and statistical complexities involved in phylogenetic estimation."In this work, we give general analytic solutions for a family of trees with four-taxa, two-state characters, under a molecular clock. The change from three to four taxa incurs a major increase in the complexity of the underlying algebraic system, and requires novel techniques and approaches. We start by presenting the general maximum likelihood problem on phylogenetic trees as a constrained optimization problem, and the resulting system of polynomial equations. In full generality, it is infeasible to solve this system, therefore specialized tools for the molecular clock case are developed. Four-taxa rooted trees have two topologies-the fork (two subtrees with two leaves each) and the comb (one subtree with three leaves, the other with a single leaf). We combine the ultrametric properties of molecular clock fork trees with the Hadamard conjugation to derive a number of topology dependent identities. Employing these identities, we substantially simplify the system of polynomial equations for the fork. We finally employ symbolic algebra software to obtain closed formanalytic solutions (expressed parametrically in the input data). In general, four-taxa trees can have multiple ML points. In contrast, we can now prove that each fork topology has a unique(local and global) ML point.  相似文献   

3.
Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees, but finding the global optimum is a hard computational task. Because no general analytic solution is known, numeric techniques such as hill climbing or expectation maximization (EM), are used in order to find optimal parameters for a given tree. So far, analytic solutions were derived only for the simplest model--three taxa, two state characters, under a molecular clock. Four taxa rooted trees have two topologies--the fork (two subtrees with two leaves each) and the comb (one subtree with three leaves, the other with a single leaf). In a previous work, we devised a closed form analytic solution for the ML molecular clock fork. In this work, we extend the state of the art in the area of analytic solutions ML trees to the family of all four taxa trees under the molecular clock assumption. The change from the fork topology to the comb incurs a major increase in the complexity of the underlying algebraic system and requires novel techniques and approaches. We combine the ultrametric properties of molecular clock trees with the Hadamard conjugation to derive a number of topology dependent identities. Employing these identities, we substantially simplify the system of polynomial equations. We finally use tools from algebraic geometry (e.g., Gr?bner bases, ideal saturation, resultants) and employ symbolic algebra software to obtain analytic solutions for the comb. We show that in contrast to the fork, the comb has no closed form solutions (expressed by radicals in the input data). In general, four taxa trees can have multiple ML points. In contrast, we can now prove that under the molecular clock assumption, the comb has a unique (local and global) ML point. (Such uniqueness was previously shown for the fork.).  相似文献   

4.
5.
Evolutionary distinctiveness measures of how evolutionarily isolated a species is relative to other members of its clade. Recently, distinctiveness metrics that explicitly incorporate time have been proposed for conservation prioritization. However, we found that such measures differ qualitatively in how well they capture the total amount of evolution (termed phylogenetic diversity, or PD) represented by a set of species. We used simulation and simple graph theory to explore this relationship with reference to phylogenetic tree shape. Overall, the distinctiveness measures capture more PD on more unbalanced trees and on trees with many splits near the present. The rank order of performance was robust across tree shapes, with apportioning measures performing best and node-based measures performing worst. A sample of 50 ultrametric trees from the literature showed the same patterns. Taken together, this suggests that distinctiveness metrics may be a useful addition to other measures of value for conservation prioritization of species. The simplest measure, the age of a species, performed surprisingly well, suggesting that new measures that focus on tree shape near the tips may provide a transparent alternative to more complicated full-tree approaches.  相似文献   

6.
The accuracy of the Fitch method for reconstructing ancestral states on ultrametric phylogenetic trees is studied. Two recurrence relations for computing the accuracy are given here. Using these relations, we analyze the convergence of the accuracy of the Fitch method for reconstructing the root state on a complete binary tree of 2 n leaves as n goes to infinity, present a closed-form formula for the accuracy on ultrametric comb trees, and provide a lower bound on the accuracy on arbitrary ultrametric phylogenetic trees.  相似文献   

7.
The advent of high-throughput technologies and the concurrent advances in information sciences have led to an explosion in size and complexity of the data sets collected in biological sciences. The biggest challenge today is to assimilate this wealth of information into a conceptual framework that will help us decipher biological functions. A large and complex collection of data, usually called a data cloud, naturally embeds multi-scale characteristics and features, generically termed geometry. Understanding this geometry is the foundation for extracting knowledge from data. We have developed a new methodology, called data cloud geometry-tree (DCG-tree), to resolve this challenge. This new procedure has two main features that are keys to its success. Firstly, it derives from the empirical similarity measurements a hierarchy of clustering configurations that captures the geometric structure of the data. This hierarchy is then transformed into an ultrametric space, which is then represented via an ultrametric tree or a Parisi matrix. Secondly, it has a built-in mechanism for self-correcting clustering membership across different tree levels. We have compared the trees generated with this new algorithm to equivalent trees derived with the standard Hierarchical Clustering method on simulated as well as real data clouds from fMRI brain connectivity studies, cancer genomics, giraffe social networks, and Lewis Carroll''s Doublets network. In each of these cases, we have shown that the DCG trees are more robust and less sensitive to measurement errors, and that they provide a better quantification of the multi-scale geometric structures of the data. As such, DCG-tree is an effective tool for analyzing complex biological data sets.  相似文献   

8.
A central challenge facing the temporal calibration of molecular phylogenies is finding a quantitative method for estimating maximum age constraints on lineage divergence times. Here, I provide such a method. This method requires an ultrametric tree generated without reference to the fossil record. Exploiting the fact that the relative branch lengths on the ultrametric tree are proportional to time, this method identifies the lineage with the greatest proportion of its true temporal range covered by the fossil record. The oldest fossil of this calibration lineage is used as the minimum age constraint. The maximum age constraint is obtained by adding a confidence interval onto the end point of the calibration lineage, thus making it possible to bracket the true divergence times of all lineages on the tree. The approach can also identify fossils that have been grossly misdated or misassigned to the phylogeny. The method assumes that the relative branch lengths on the ultrametric tree are accurate and that fossilization is random. The effect of violations of these assumptions is assessed. This method is simple to use and is illustrated with a reanalysis of Near et al.'s turtle data.  相似文献   

9.
MOTIVATION: Reconstructing evolutionary trees is an important problem in biology. A response to the computational intractability of most of the traditional criteria for inferring evolutionary trees has been a focus on new criteria, particularly quartet-based methods that seek to merge trees derived on subsets of four species from a given species-set into a tree for that entire set. Unfortunately, most of these methods are very sensitive to errors in the reconstruction of the trees for individual quartets of species. A recently developed technique called quartet cleaning can alleviate this difficulty in certain cases by using redundant information in the complete set of quartet topologies for a given species-set to correct such errors. RESULTS: In this paper, we describe two new local vertex quartet cleaning algorithms which have optimal time complexity and error-correction bound, respectively. These are the first known local vertex quartet cleaning algorithms that are optimal with respect to either of these attributes.  相似文献   

10.
Molecular phylogenies are often used to test hypotheses about the tempo and mode of speciation and extinction. One commonly used statistic is Pybus and Harvey's γ, which measures the density of ordered internode distances on an ultrametric tree to infer earlier (negative γ) or later (positive γ) bursts of diversification. However, coalescent theory predicts that γ might be biased toward negative values (inferring early bursts of diversification) when using gene trees rather than species trees. Gene divergences predate species divergences, increasingly so at higher effective population sizes (N(e)), and proportionally more so toward the tips of the tree. Thus, gene trees will have a higher density of older nodes in many cases (particularly at higher N(e)), due to the disproportionate lengthening of terminal branches. This will yield an artifactual signature of early bursts of diversification when estimating γ from gene trees. We simulate gene trees within species trees under both Yule (pure-birth) and birth-death processes, and demonstrate support for these predictions. However, for most realistic estimates of θ in natural populations, gene trees provide relatively good estimates of γ, despite the disproportionate overestimation of younger node ages. This is corroborated with an empirical dataset of North American fence lizards (Sceloporus).  相似文献   

11.
Various DNA sequence-based methods for species delineation have recently been developed to assess the species-richness of highly diverse, neglected invertebrate taxa. These methods, however, need to be tested under a variety of conditions, including the use of different markers and parameters. Here, we explored the species diversity of a species-rich group of braconid parasitoid wasps, the Neotropical genus Notiospathius, including 233 specimens from 10 different countries. We examined sequences of two mitochondrial (mt) (COI, cyt b) and one nuclear (wg) gene fragments. We analysed them separately as well as concatenating the mt data with the general mixed Yule-coalescent (GMYC) model for species delineation using different tree-building methods and parameters for reconstructing ultrametric trees. We evaluated the performance of GMYC analyses by comparing their species delineations with our morphospecies identifications. Reconstructing ultrametric trees with a relaxed lognormal clock rate using the program BEAST gave the most congruent results with morphology for the two mt markers. A tree obtained with wg using the programs MrBayes+Pathd8 had the fewest cases of incongruence with morphology, though the performance of this nuclear marker was considerably lower than that of COI and cyt b. Species delimitation using the coalescent prior to obtain ultrametric trees was morphologically more congruent with COI, whereas the Yule prior was more congruent with cyt b. The analyses concatenating the mt datasets failed to recover some species supported both by morphology and the separate analyses of the mt markers. The highest morphological congruence was obtained with the GMYC analysis on an ultrametric tree reconstructed with cyt b using the relaxed lognormal clock rate and the Yule prior, thus supporting the importance of using alternative markers when the information of the barcoding locus (COI) is not concordant with morphological evidence. Seventy-one species were delimited based on the congruence found among COI, cyt b and morphology. Both mt markers also revealed the existence of seven potential cryptic species. This high species richness from a scattered geographical sampling indicates that there is a remarkable number of Notiospathius species that remains undiscovered.  相似文献   

12.
The Noah's Ark Problem (NAP) is a comprehensive cost-effectiveness methodology for biodiversity conservation that was introduced by Weitzman (1998) and utilizes the phylogenetic tree containing the taxa of interest to assess biodiversity. Given a set of taxa, each of which has a particular survival probability that can be increased at some cost, the NAP seeks to allocate limited funds to conserving these taxa so that the future expected biodiversity is maximized. Finding optimal solutions using this framework is a computationally difficult problem to which a simple and efficient "greedy" algorithm has been proposed in the literature and applied to conservation problems. We show that, although algorithms of this type cannot produce optimal solutions for the general NAP, there are two restricted scenarios of the NAP for which a greedy algorithm is guaranteed to produce optimal solutions. The first scenario requires the taxa to have equal conservation cost; the second scenario requires an ultrametric tree. The NAP assumes a linear relationship between the funding allocated to conservation of a taxon and the increased survival probability of that taxon. This relationship is briefly investigated and one variation is suggested that can also be solved using a greedy algorithm.  相似文献   

13.
A method to assess the cost of character state transformations based on their congruence is proposed. Measuring the distortion of different transformations with a convex increasing function of the number of transformations, and choosing those reconstructions which minimize the distortion for all transformations, may provide a better optimality criterion than the linear functions implemented in currently used methods for optimization. If trees are optimized using such a measure, transformation costs are dynamically determined during reconstructions; this leads to selecting trees implying that the possible state transformations are as reliable as possible. The present method is not iterative (thus avoiding the concern of different final results for different starting points), and it has an explicit optimality criterion. It has a high computational cost; algorithms to lessen the computations required for optimizations and searches are described.  相似文献   

14.
In many applications, one may need to characterize a given network among a large set of base networks, and these networks are large in size and diverse in structure over the search space. In addition, the characterization algorithms are required to have low volatility and with a small circle of uncertainty. For large datasets, these algorithms are computationally intensive and inefficient. However, under the context of network mining, a major concern of some applications is speed. Hence, we are motivated to develop a fast characterization algorithm, which can be used to quickly construct a graph space for analysis purpose. Our approach is to transform a network characterization measure, commonly formulated based on similarity matrices, into simple vector form signatures. We shall show that the similarity matrix can be represented by a dyadic product of two N-dimensional signature vectors; thus the network alignment process, which is usually solved as an assignment problem, can be reduced into a simple alignment problem based on separate signature vectors.  相似文献   

15.
MOTIVATION: Maximum-likelihood analysis of nucleotide and amino acid sequences is a powerful approach for inferring phylogenetic relationships and for comparing evolutionary hypotheses. Because it is a computationally demanding and time-consuming process, most algorithms explore only a minute portion of tree-space, with the emphasis on finding the most likely tree while ignoring the less likely, but not significantly worse, trees. However, when such trees exist, it is equally important to identify them to give due consideration to the phylogenetic uncertainty. Consequently, it is necessary to change the focus of these algorithms such that near optimal trees are also identified. RESULTS: This paper presents the Advanced Stepwise Addition Algorithm for exploring tree-space and two algorithms for generating all binary trees on a set of sequences. The Advanced Stepwise Addition Algorithm has been implemented in TrExML, a phylogenetic program for maximum-likelihood analysis of nucleotide sequences. TrExML is shown to be more effective at finding near optimal trees than a similar program, fastDNAml, implying that TrExML offers a better approach to account for phylogenetic uncertainty than has previously been possible. A program, TreeGen, is also described; it generates binary trees on a set of sequences allowing for extensive exploration of tree-space using other programs. AVAILABILITY: TreeGen, TrExML, and the sequence data used to test the programs are available from the following two WWW sites: http://whitetail.bemidji.msus. edu/trexml/and http://jcsmr.anu.edu.au/dmm/humgen.+ ++html.  相似文献   

16.
Identifying the common structural elements of functionally related RNA sequences (family) is usually based on an alignment of the sequences, which is often subject to human bias and may not be accurate. The resulting covariance model (CM) provides probabilities for each base to covary with another, which allows to support evolutionarily the formation of double helical regions and possibly pseudoknots. The coexistence of alternative folds in RNA, resulting from its dynamic nature, may lead to the potential omission of motifs by CM. To overcome this limitation, we present D-ORB, a system of algorithms that identifies overrepresented motifs in the secondary conformational landscapes of a family when compared to those of unrelated sequences. The algorithms are bundled into an easy-to-use website allowing users to submit a family, and optionally provide unrelated sequences. D-ORB produces a non-pseudoknotted secondary structure based on the overrepresented motifs, a deep neural network classifier and two decision trees. When used to model an Rfam family, D-ORB fits overrepresented motifs in the corresponding Rfam structure; more than a hundred Rfam families have been modeled. The statistical approach behind D-ORB derives the structural composition of an RNA family, making it a valuable tool for analyzing and modeling it. Its easy-to-use interface and advanced algorithms make it an essential resource for researchers studying RNA structure. D-ORB is available at https://d-orb.major.iric.ca/.  相似文献   

17.
The most widely used evolutionary model for phylogenetic trees is the equal-rates Markov (ERM) model. A problem is that the ERM model predicts less imbalance than observed for trees inferred from real data; in fact, the observed imbalance tends to fall between the values predicted by the ERM model and those predicted by the proportional-to-distinguishable-arrangements (PDA) model. Here, a continuous multi-rate (MR) family of evolutionary models is presented which contains entire subfamilies corresponding to both the PDA and ERM models. Furthermore, this MR family covers an entire range from 'completely balanced' to 'completely unbalanced' models. In particular, the MR family contains other known evolutionary models. The MR family is very versatile and virtually free of assumptions on the character of evolution; yet it is highly susceptible to rigorous analyses. In particular, such analyses help to uncover adaptability, quasi-stabilization and prolonged stasis as major possible causes of the imbalance. However, the MR model is functionally simple and requires only three parameters to reproduce the observed imbalance.  相似文献   

18.
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

The datasets are available at http://hydra.icgeb.trieste.it/benchmark.  相似文献   


19.
A family of five periplasmic triheme cytochromes (PpcA-E) was identified in Geobacter sulfurreducens, where they play a crucial role by driving electron transfer from the cytoplasm to the cell exterior and assisting the reduction of extracellular acceptors. The thermodynamic characterization of PpcA using NMR and visible spectroscopies was previously achieved under experimental conditions identical to those used for the triheme cytochrome c7 from Desulfuromonas acetoxidans. Under such conditions, attempts to obtain NMR data were complicated by the relatively fast intermolecular electron exchange. This work reports the detailed thermodynamic characterization of PpcB, PpcD, and PpcE under optimal experimental conditions. The thermodynamic characterization of PpcA was redone under these new conditions to allow a proper comparison of the redox properties with those of other members of this family. The heme reduction potentials of the four proteins are negative, differ from each other, and cover different functional ranges. These reduction potentials are strongly modulated by heme-heme interactions and by interactions with protonated groups (the redox-Bohr effect) establishing different cooperative networks for each protein, which indicates that they are designed to perform different functions in the cell. PpcA and PpcD appear to be optimized to interact with specific redox partners involving e/H+ transfer via different mechanisms. Although no evidence of preferential electron transfer pathway or e/H+ coupling was found for PpcB and PpcE, the difference in their working potential ranges suggests that they may also have different physiological redox partners. This is the first study, to our knowledge, to characterize homologous cytochromes from the same microorganism and provide evidence of their different mechanistic and functional properties. These findings provide an explanation for the coexistence of five periplasmic triheme cytochromes in G. sulfurreducens.  相似文献   

20.
We consider the problem of reconstructing near-perfect phylogenetic trees using binary character states (referred to as BNPP). A perfect phylogeny assumes that every character mutates at most once in the evolutionary tree, yielding an algorithm for binary character states that is computationally efficient but not robust to imperfections in real data. A near-perfect phylogeny relaxes the perfect phylogeny assumption by allowing at most a constant number of additional mutations. We develop two algorithms for constructing optimal near-perfect phylogenies and provide empirical evidence of their performance. The first simple algorithm is fixed parameter tractable when the number of additional mutations and the number of characters that share four gametes with some other character are constants. The second, more involved algorithm for the problem is fixed parameter tractable when only the number of additional mutations is fixed. We have implemented both algorithms and shown them to be extremely efficient in practice on biologically significant data sets. This work proves the BNPP problem fixed parameter tractable and provides the first practical phylogenetic tree reconstruction algorithms that find guaranteed optimal solutions while being easily implemented and computationally feasible for data sets of biologically meaningful size and complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号