首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
As an alternative to parsimony analyses, stochastic models have been proposed ( [Lewis, 2001] and [Nylander et al., 2004]) for morphological characters, so that maximum likelihood or Bayesian analyses may be used for phylogenetic inference. A key feature of these models is that they account for ascertainment bias, in that only varying, or parsimony-informative characters are observed. However, statistical consistency of such model-based inference requires that the model parameters be identifiable from the joint distribution they entail, and this issue has not been addressed.Here we prove that parameters for several such models, with finite state spaces of arbitrary size, are identifiable, provided the tree has at least eight leaves. If the tree topology is already known, then seven leaves suffice for identifiability of the numerical parameters. The method of proof involves first inferring a full distribution of both parsimony-informative and non-informative pattern joint probabilities from the parsimony-informative ones, using phylogenetic invariants. The failure of identifiability of the tree parameter for four-taxon trees is also investigated.  相似文献   

2.
3.
A method for computing the likelihood of a set of sequences assuming a phylogenetic network as an evolutionary hypothesis is presented. The approach applies directed graphical models to sequence evolution on networks and is a natural generalization of earlier work by Felsenstein on evolutionary trees, including it as a special case. The likelihood computation involves several steps. First, the phylogenetic network is rooted to form a directed acyclic graph (DAG). Then, applying standard models for nucleotide/amino acid substitution, the DAG is converted into a Bayesian network from which the joint probability distribution involving all nodes of the network can be directly read. The joint probability is explicitly dependent on branch lengths and on recombination parameters (prior probability of a parent sequence). The likelihood of the data assuming no knowledge of hidden nodes is obtained by marginalization, i.e., by summing over all combinations of unknown states. As the number of terms increases exponentially with the number of hidden nodes, a Markov chain Monte Carlo procedure (Gibbs sampling) is used to accurately approximate the likelihood by summing over the most important states only. Investigating a human T-cell lymphotropic virus (HTLV) data set and optimizing both branch lengths and recombination parameters, we find that the likelihood of a corresponding phylogenetic network outperforms a set of competing evolutionary trees. In general, except for the case of a tree, the likelihood of a network will be dependent on the choice of the root, even if a reversible model of substitution is applied. Thus, the method also provides a way in which to root a phylogenetic network by choosing a node that produces a most likely network.  相似文献   

4.
Performance measures of phylogenetic estimation methods such as accuracy, consistency, and power are an attempt at summarizing an ensemble of a given estimator's behavior. These summaries characterize an ensemble behavior with a single number, leading to a variety of definitions. In particular, the relationships between different performance measures such as accuracy and consistency or accuracy and error depend on the exact definition of these measures. In addition, it is relatively common to use large-sample behavior to infer similar behavior for small samples. In fact, large-sample results such as the claimed asymptotic efficiency of the maximum-likelihood estimator are often uninformative for small samples. Conversely, small-sample behavior using simulations is sometimes used to imply large-sample behavior such as consistency. However, such extrapolation is often difficult. How the performance of a phylogenetic estimator scales with the addition of taxa must be qualified with respect to whether the whole tree is being estimated or a fixed subset of taxa is being estimated. It must also be qualified with respect to how tree models are sampled. Over the ensemble of all possible trees of a given size, the performance of the estimators for the whole tree estimate suffers when the tree size becomes larger. However, under certain models of cladogenesis, the estimate can improve with the addition of taxa. In fact, at all numbers of taxa there are subsets of tree models that are easier to estimate than others. This suggests that with judicious addition or subtraction of taxa we can move from tree models that are more difficult to estimate at one number of taxa to those that are easier to estimate at another number of taxa.  相似文献   

5.
A major problem for the identification of metabolic network models is parameter identifiability, that is, the possibility to unambiguously infer the parameter values from the data. Identifiability problems may be due to the structure of the model, in particular implicit dependencies between the parameters, or to limitations in the quantity and quality of the available data. We address the detection and resolution of identifiability problems for a class of pseudo-linear models of metabolism, so-called linlog models. Linlog models have the advantage that parameter estimation reduces to linear or orthogonal regression, which facilitates the analysis of identifiability. We develop precise definitions of structural and practical identifiability, and clarify the fundamental relations between these concepts. In addition, we use singular value decomposition to detect identifiability problems and reduce the model to an identifiable approximation by a principal component analysis approach. The criterion is adapted to real data, which are frequently scarce, incomplete, and noisy. The test of the criterion on a model with simulated data shows that it is capable of correctly identifying the principal components of the data vector. The application to a state-of-the-art dataset on central carbon metabolism in Escherichia coli yields the surprising result that only $4$ out of $31$ reactions, and $37$ out of $100$ parameters, are identifiable. This underlines the practical importance of identifiability analysis and model reduction in the modeling of large-scale metabolic networks. Although our approach has been developed in the context of linlog models, it carries over to other pseudo-linear models, such as generalized mass-action (power-law) models. Moreover, it provides useful hints for the identifiability analysis of more general classes of nonlinear models of metabolism.  相似文献   

6.
Statistical models of evolution are algebraic varieties in the space of joint probability distributions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model are the polynomials which vanish on the variety. Several widely used models for biological sequences have transition matrices that can be diagonalized by means of the Fourier transform of an abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. We determine generators and Gr?bner bases for these toric ideals. For the Jukes-Cantor and Kimura models on a binary tree, our Gr?bner bases consist of certain explicitly constructed polynomials of degree at most four.  相似文献   

7.
Statistical models of evolution are algebraic varieties in the space of joint probability distributions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model are the polynomials which vanish on the variety. Several widely used models for biological sequences have transition matrices that can be diagonalized by means of the Fourier transform of an Abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. We determine generators and Gr?bner bases for these toric ideals. For the Jukes-Cantor and Kimura models on a binary tree, our Gr?bner bases consist of certain explicitly constructed polynomials of degree at most four.  相似文献   

8.
Suchard MA 《Genetics》2005,170(1):419-431
Horizontal gene transfer (HGT) plays a critical role in evolution across all domains of life with important biological and medical implications. I propose a simple class of stochastic models to examine HGT using multiple orthologous gene alignments. The models function in a hierarchical phylogenetic framework. The top level of the hierarchy is based on a random walk process in "tree space" that allows for the development of a joint probabilistic distribution over multiple gene trees and an unknown, but estimable species tree. I consider two general forms of random walks. The first form is derived from the subtree prune and regraft (SPR) operator that mirrors the observed effects that HGT has on inferred trees. The second form is based on walks over complete graphs and offers numerically tractable solutions for an increasing number of taxa. The bottom level of the hierarchy utilizes standard phylogenetic models to reconstruct gene trees given multiple gene alignments conditional on the random walk process. I develop a well-mixing Markov chain Monte Carlo algorithm to fit the models in a Bayesian framework. I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis. Using 144 orthologous gene alignments from six prokaryotes previously collected and analyzed, Bayesian model selection finds support for (1) the SPR model over the alternative form, (2) the 16S rRNA reconstruction as the most likely species tree, and (3) increased HGT of operational genes compared to informational genes.  相似文献   

9.
The general Markov plus invariable sites (GM+I) model of biological sequence evolution is a two-class model in which an unknown proportion of sites are not allowed to change, while the remainder undergo substitutions according to a Markov process on a tree. For statistical use it is important to know if the model is identifiable; can both the tree topology and the numerical parameters be determined from a joint distribution describing sequences only at the leaves of the tree? We establish that for generic parameters both the tree and all numerical parameter values can be recovered, up to clearly understood issues of 'label swapping'. The method of analysis is algebraic, using phylogenetic invariants to study the variety defined by the model. Simple rational formulas, expressed in terms of determinantal ratios, are found for recovering numerical parameters describing the invariable sites.  相似文献   

10.
In phylogenetic inference, an evolutionary model describes the substitution processes along each edge of a phylogenetic tree. Misspecification of the model has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. We introduce a method for model Selection in Phylogenetics based on linear INvariants (SPIn), which uses recent insights on linear invariants to characterize a model of nucleotide evolution for phylogenetic mixtures on any number of components. Linear invariants are constraints among the joint probabilities of the bases in the operational taxonomic units that hold irrespective of the tree topologies appearing in the mixtures. SPIn therefore requires no input tree and is designed to deal with nonhomogeneous phylogenetic data consisting of multiple sequence alignments showing different patterns of evolution, for example, concatenated genes, exons, and/or introns. Here, we report on the results of the proposed method evaluated on multiple sequence alignments simulated under a variety of single-tree and mixture settings for both continuous- and discrete-time models. In the simulations, SPIn successfully recovers the underlying evolutionary model and is shown to perform better than existing approaches.  相似文献   

11.
Chis OT  Banga JR  Balsa-Canto E 《PloS one》2011,6(11):e27755
Analysing the properties of a biological system through in silico experimentation requires a satisfactory mathematical representation of the system including accurate values of the model parameters. Fortunately, modern experimental techniques allow obtaining time-series data of appropriate quality which may then be used to estimate unknown parameters. However, in many cases, a subset of those parameters may not be uniquely estimated, independently of the experimental data available or the numerical techniques used for estimation. This lack of identifiability is related to the structure of the model, i.e. the system dynamics plus the observation function. Despite the interest in knowing a priori whether there is any chance of uniquely estimating all model unknown parameters, the structural identifiability analysis for general non-linear dynamic models is still an open question. There is no method amenable to every model, thus at some point we have to face the selection of one of the possibilities. This work presents a critical comparison of the currently available techniques. To this end, we perform the structural identifiability analysis of a collection of biological models. The results reveal that the generating series approach, in combination with identifiability tableaus, offers the most advantageous compromise among range of applicability, computational complexity and information provided.  相似文献   

12.
Holzmann H  Munk A  Zucchini W 《Biometrics》2006,62(3):934-6; discussion 936-9
We study the issue of identifiability of mixture models in the context of capture-recapture abundance estimation for closed populations. Such models are used to take account of individual heterogeneity in capture probabilities, but their validity was recently questioned by Link (2003, Biometrics 59, 1123-1130) on the basis of their nonidentifiability. We give a general criterion for identifiability of the mixing distribution, and apply it to establish identifiability within families of mixing distributions that are commonly used in this context, including finite and beta mixtures. Our analysis covers binomial and geometrically distributed outcomes. In an example we highlight the difference between the identifiability issue considered here and that in classical binomial mixture models.  相似文献   

13.
Statistical tests of models of DNA substitution   总被引:32,自引:0,他引:32  
Summary Penny et al. have written that The most fundamental criterion for a scientific method is that the data must, in principle, be able to reject the model. Hardly any [phylogenetic] tree-reconstruction methods meet this simple requirement. The ability to reject models is of such great importance because the results of all phylogenetic analyses depend on their underlying models—to have confidence in the inferences, it is necessary to have confidence in the models. In this paper, a test statistics suggested by Cox is employed to test the adequacy of some statistical models of DNA sequence evolution used in the phylogenetic inference method introduced by Felsentein. Monte Carlo simulations are used to assess significance levels. The resulting statistical tests provide an objective and very general assessment of all the components of a DNA substitution model; more specific versions of the test are devised to test individual components of a model. In all cases, the new analyses have the additional advantage that values of phylogenetic parameters do not have to be assumed in order to perform the tests.  相似文献   

14.
Covarion models of character evolution describe inhomogeneities in substitution processes through time. In phylogenetics, such models are used to describe changing functional constraints or selection regimes during the evolution of biological sequences. In this work the identifiability of such models for generic parameters on a known phylogenetic tree is established, provided the number of covarion classes does not exceed the size of the observable state space. `Generic parameters' as used here means all parameters except possibly those in a set of measure zero within the parameter space. Combined with earlier results, this implies both the tree and generic numerical parameters are identifiable if the number of classes is strictly smaller than the number of observable states.  相似文献   

15.
Neutral macroevolutionary models, such as the Yule model, give rise to a probability distribution on the set of discrete rooted binary trees over a given leaf set. Such models can provide a signal as to the approximate location of the root when only the unrooted phylogenetic tree is known, and this signal becomes relatively more significant as the number of leaves grows. In this short note, we show that among models that treat all taxa equally, and are sampling consistent (i.e. the distribution on trees is not affected by taxa yet to be included), all such models, except one (the so-called PDA model), convey some information as to the location of the ancestral root in an unrooted tree.  相似文献   

16.
Phylogenetic comparative methods that incorporate intraspecific variability are relatively new and, so far, not especially widely used in empirical studies. In the present short article we will describe a new Bayesian method for fitting evolutionary models to comparative data that incorporates intraspecific variability. This method differs from an existing likelihood-based approach in that it requires no a priori inference about species means and variances; rather it takes phenotypic values from individuals and a phylogenetic tree as input, and then samples species means and variances, along with the parameters of the evolutionary model, from their joint posterior probability distribution. One of the most novel and intriguing attributes of this approach is that jointly sampling the species means with the evolutionary model parameters means that the model and tree can influence our estimates of species mean trait values, not just the reverse. In the present implementation, we first apply this method to the most widely used evolutionary model for continuously valued phenotypic trait data (Brownian motion). However, the general approach has broad applicability, which we illustrate by also fitting the λ model, another simple model for quantitative trait evolution on a phylogeny. We test our approach via simulation and by analyzing two empirical datasets obtained from the literature. Finally, we have implemented the methods described herein in a new function for the R statistical computing environment, and this function will be distributed as part of the 'phytools' R library.  相似文献   

17.
We investigate some discrete structural properties of evolutionary trees generated under simple null models of speciation, such as the Yule model. These models have been used as priors in Bayesian approaches to phylogenetic analysis, and also to test hypotheses concerning the speciation process. In this paper we describe new results for three properties of trees generated under such models. Firstly, for a rooted tree generated by the Yule model we describe the probability distribution on the depth (number of edges from the root) of the most recent common ancestor of a random subset of k species. Next we show that, for trees generated under the Yule model, the approximate position of the root can be estimated from the associated unrooted tree, even for trees with a large number of leaves. Finally, we analyse a biologically motivated extension of the Yule model and describe its distribution on tree shapes when speciation occurs in rapid bursts.  相似文献   

18.
Tests of applicability of several substitution models for DNA sequence data   总被引:8,自引:3,他引:5  
Using linear invariants for various models of nucleotide substitution, we developed test statistics for examining the applicability of a specific model to a given dataset in phylogenetic inference. The models examined are those developed by Jukes and Cantor (1969), Kimura (1980), Tajima and Nei (1984), Hasegawa et al. (1985), Tamura (1992), Tamura and Nei (1993), and a new model called the eight-parameter model. The first six models are special cases of the last model. The test statistics developed are independent of evolutionary time and phylogeny, although the variances of the statistics contain phylogenetic information. Therefore, these statistics can be used before a phylogenetic tree is estimated. Our objective is to find the simplest model that is applicable to a given dataset, keeping in mind that a simple model usually gives an estimate of evolutionary distance (number of nucleotide substitutions per site) with a smaller variance than a complicated model when the simple model is correct. We have also developed a statistical test of the homogeneity of nucleotide frequencies of a sample of several sequences that takes into account possible phylogenetic correlations. This test is used to examine the stationarity in time of the base frequencies in the sample. For Hasegawa et al.'s and the eight-parameter models, analytical formulas for estimating evolutionary distances are presented. Application of the above tests to several sets of real data has shown that the assumption of stationarity of base composition is usually acceptable when the sequences studied are closely related but otherwise it is rejected. Similarly, the simple models of nucleotide substitution are almost always rejected when actual genes are distantly related and/or the total number of nucleotides examined is large.   相似文献   

19.
In historical biogeography, phylogenetic trees have long been used as tools for addressing a wide range of inference problems, from explaining common distribution patterns of species to reconstructing ancestral geographic ranges on branches of the tree of life. However, the potential utility of phylogenies for this purpose has yet to be fully realized, due in part to a lack of explicit conceptual links between processes underlying the evolution of geographic ranges and processes of phylogenetic tree growth. We suggest that statistical approaches that use parametric models to forge such links will stimulate integration and propel hypothesis-driven biogeographical inquiry in new directions. We highlight here two such approaches and describe how they represent early steps towards a more general framework for model-based historical biogeography that is based on likelihood as an optimality criterion, rather than having the traditional reliance on parsimony. The development of this framework will not be without significant challenges, particularly in balancing model complexity with statistical power, and these will be most apparent in studies of regions with many component areas and complex geological histories, such as the Mediterranean Basin.  相似文献   

20.
应用CGH数据和树模型探索癌症的发病机理   总被引:1,自引:0,他引:1  
李小波  陈俭  吕炳建  来茂德 《遗传》2008,30(4):407-412
比较基因组杂交技术(comparative genomic hybridization, CGH)主要用于检测肿瘤的染色体缺失和扩嘱, 迄今已积累了大量的实验数据, 为全基因组分析肿瘤的发生机制提供了可能。树模型在生物信息学领域通常被用于研究生物形成和进化的历史, 物种之间的进化关系常以系统发生树来表示。树模型同样可以作为一种有力的生物信息学工具来分析CGH数据, 探索癌症的发病机理。文中介绍了两种常见的树模型—— 分支树和距离树, 详细叙述了重建树模型的基本原理和方法, 分析了创建树模型时要注意的几个技术问题, 并对其在肿瘤研究中的应用进行了回顾和总结。肿瘤的树状模型作为单路径线性模型的泛化, 克服了以往单路径线性模型的缺点, 理论上能更加精确地概括到肿脉的多基因、多路径、多阶段的发生发展模式, 从不同角度探讨肿瘤发生发展的分子机制。该模型除可用于分析肿瘤的CGH数据外, 还可用于分析其他多种类型的数据, 包括微阵列CGH(array-CGH)技术等产生的高分辨率数据。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号