首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This study is a phylogenetic analysis of the avian family Ciconiidae, the storks, based on two molecular data sets: 1065 base pairs of sequence from the mitochondrial cytochromebgene and a complete matrix of single-copy nuclear DNA–DNA hybridization distances. Sixteen of the nineteen stork species were included in the cytochromebdata matrix, and fifteen in the DNA–DNA hybridization matrix. Both matrices included outgroups from the families Cathartidae (New World vultures) and Threskiornithidae (ibises, spoonbills). Optimal trees based on the two data sets were congruent in those nodes with strong bootstrap support. In the best-fit tree based on DNA–DNA hybridization distances, nodes defining relationships among very recently diverged species had low bootstrap support, while nodes defining more distant relationships had strong bootstrap support. In the optimal trees based on the sequence data, nodes defining relationships among recently diverged species had strong bootstrap support, while nodes defining basal relationships in the family had weak support and were incongruent among analyses. A combinable-component consensus of the best-fit DNA–DNA hybridization tree and a consensus tree based on different analyses of the cytochromebsequences provide the best estimate of relationships among stork species based on the two data sets.  相似文献   

2.
Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, model-based likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examined model fit and sensitivity to such potentially heterogeneous data partitions within combined data analyses using empirical data. Here we investigate the relative model fit and sensitivity of Bayesian phylogenetic methods when alternative site-specific partitions of among-site rate variation (with and without autocorrelated rates) are considered. Our primary goal in choosing a best-fit model was to employ the simplest model that was a good fit to the data while optimizing topology and/or Bayesian posterior probabilities. Thus, we were not interested in complex models that did not practically affect our interpretation of the topology under study. We applied these alternative models to a four-gene data set including one protein-coding nuclear gene (c-mos), one protein-coding mitochondrial gene (ND4), and two mitochondrial rRNA genes (12S and 16S) for the diverse yet poorly known lizard family Gymnophthalmidae. Our results suggest that the best-fit model partitioned among-site rate variation separately among the c-mos, ND4, and 12S + 16S gene regions. We found this model yielded identical topologies to those from analyses based on the GTR+I+G model, but significantly changed posterior probability estimates of clade support. This partitioned model also produced more precise (less variable) estimates of posterior probabilities across generations of long Bayesian runs, compared to runs employing a GTR+I+G model estimated for the combined data. We use this three-way gamma partitioning in Bayesian analyses to reconstruct a robust phylogenetic hypothesis for the relationships of genera within the lizard family Gymnophthalmidae. We then reevaluate the higher-level taxonomic arrangement of the Gymnophthalmidae. Based on our findings, we discuss the utility of nontraditional parameters for modeling among-site rate variation and the implications and future directions for complex model building and testing.  相似文献   

3.
Studies have been undertaken to explore the applicability of different kinetic models for the performance appraisal of upflow anaerobic sludge blanket (UASB) reactors treating wastewater in the range of 300-4000 mg COD/l. Three kinetic models namely, Monod, Grau second-order, and Haldane model are considered for the analysis. Both linear and nonlinear regressions have been performed to examine the best-fit among the kinetic models. In this process, five error analysis methods have been used to analyze the data. Apart from optimization of kinetic coefficients with minimization of associated errors, prediction of effluent COD has also been undertaken to verify the applicability of kinetic models. In both the cases, Grau second-order model is found to be the best class of fit for wide range of data sets in UASB reactor.  相似文献   

4.
Despite the proliferation of increasingly sophisticated models of DNA sequence evolution, choosing among models remains a major problem in phylogenetic reconstruction. The choice of appropriate models is thought to be especially important when there is large variation among branch lengths. We evaluated the ability of nested models to reconstruct experimentally generated, known phylogenies of bacteriophage T7 as we varied the terminal branch lengths. Then, for each phylogeny we determined the best-fit model by progressively adding parameters to simpler models. We found that in several cases the choice of best-fit model was affected by the parameter addition sequence. In terms of phylogenetic performance, there was little difference between models when the ratio of short: long terminal branches was 1:3 or less. However, under conditions of extreme terminal branch-length variation, there were not only dramatic differences among models, but best-fit models were always among the best at overcoming long-branch attraction. The performance of minimum-evolution-distance methods was generally lower than that of discrete maximum-likelihood methods, even if maximum-likelihood methods were used to generate distance matrices. Correcting for among-site rate variation was especially important for overcoming long-branch attraction. The generality of our conclusions is supported by earlier simulation studies and by a preliminary analysis of mitochondrial and nuclear sequences from a well-supported four-taxon amniote phylogeny.  相似文献   

5.

Background

Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled.

Results

The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used.

Conclusions

FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.  相似文献   

6.
Species–climate ‘envelope’ models are widely used to evaluate potential climate change impacts upon species and biodiversity. Previous studies have used a variety of methods to fit models making it difficult to assess relative model performance for different taxonomic groups, life forms or trophic levels. Here we use the same climatic data and modelling approach for 306 European species representing three major taxa (higher plants, insects and birds), and including species of different life form and from four trophic levels. Goodness‐of‐fit measures showed that useful models were fitted for >96% of species, and that model performance was related neither to major taxonomic group nor to trophic level. These results confirm that such climate envelope models provide the best approach currently available for evaluating reliably the potential impacts of future climate change upon biodiversity.  相似文献   

7.
The distribution of selection coefficients of new mutations is of key interest in population genetics. In this paper we explore how codon-based likelihood models can be used to estimate the distribution of selection coefficients of new amino acid replacement mutations from phylogenetic data. To obtain such estimates we assume that all mutations at the same site have the same selection coefficient. We first estimate the distribution of selection coefficients from two large viral data sets under the assumption that the viral population size is the same along all lineages of the phylogeny and that the selection coefficients vary among sites. We then implement several new models in which the lineages of the phylogeny may have different population sizes. We apply the new models to a data set consisting of the coding regions from eight primate mitochondrial genomes. The results suggest that there might be little power to determine the exact shape of the distribution of selection coefficient but that the normal and gamma distributions fit the data significantly better than the exponential distribution.  相似文献   

8.
9.
Codon substitution models have traditionally been parametric Markov models, but recently, empirical and semiempirical models also have been proposed. Parametric codon models are typically based on 61×61 rate matrices that are derived from a small number of parameters. These parameters are rooted in experience and theoretical considerations and generally show good performance but are still relatively arbitrary. We have previously used principal component analysis (PCA) on data obtained from mammalian sequence alignments to empirically identify the most relevant parameters for codon substitution models, thereby confirming some commonly used parameters but also suggesting new ones. Here, we present a new semiempirical codon substitution model that is directly based on those PCA results. The substitution rate matrix is constructed from linear combinations of the first few (the most important) principal components with the coefficients being free model parameters. Thus, the model is not only based on empirical rates but also uses the empirically determined most relevant parameters for a codon model to adjust to the particularities of individual data sets. In comparisons against established parametric and semiempirical models, the new model consistently achieves the highest likelihood values when applied to sequences of vertebrates, which include the taxonomic class where the model was trained on.  相似文献   

10.
11.
The relationship between species abundance, the variance of the number of individuals, and species occupancy is a fundamental ecological characteristic of a community. Moreover, this relationship varies across scales, and any model for the variance-occupancy-abundance (VOA) relationship has to address its scale dependency in a consistent way. In this study, point-process theory was used to define a multiscale model that jointly predicts the VOA relationship across scales in a consistent way. This provides a tool to jointly analyze data sets collected at different scales and to give insights into the biological processes underlying the VOA relationship. This model can also account for different types of individual spatial pattern (clustered, random, or regular). Three stand-mapping data sets of tree species in tropical rain forests were used to assess the relevance of this model. When compared with four existing models, the model based on point-process theory provided the best fit to the data and was the most often ranked as the model with the best predictive performance.  相似文献   

12.
A general comparison of relaxed molecular clock models   总被引:4,自引:0,他引:4  
Several models have been proposed to relax the molecular clock in order to estimate divergence times. However, it is unclear which model has the best fit to real data and should therefore be used to perform molecular dating. In particular, we do not know whether rate autocorrelation should be considered or which prior on divergence times should be used. In this work, we propose a general bench mark of alternative relaxed clock models. We have reimplemented most of the already existing models, including the popular lognormal model, as well as various prior choices for divergence times (birth-death, Dirichlet, uniform), in a common Bayesian statistical framework. We also propose a new autocorrelated model, called the "CIR" process, with well-defined stationary properties. We assess the relative fitness of these models and priors, when applied to 3 different protein data sets from eukaryotes, vertebrates, and mammals, by computing Bayes factors using a numerical method called thermodynamic integration. We find that the 2 autocorrelated models, CIR and lognormal, have a similar fit and clearly outperform uncorrelated models on all 3 data sets. In contrast, the optimal choice for the divergence time prior is more dependent on the data investigated. Altogether, our results provide useful guidelines for model choice in the field of molecular dating while opening the way to more extensive model comparisons.  相似文献   

13.
Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.  相似文献   

14.
This paper presents a pipeline, implemented in an open‐source program called GB→TNT (GenBank‐to‐TNT), for creating large molecular matrices, starting from GenBank files and finishing with TNT matrices which incorporate taxonomic information in the terminal names. GB→TNT is designed to retrieve a defined genomic region from a bulk of sequences included in a GenBank file. The user defines the genomic region to be retrieved and several filters (genome, length of the sequence, taxonomic group, etc.); each genomic region represents a different data block in the final TNT matrix. GB→TNT first generates Fasta files from the input GenBank files, then creates an alignment for each of those (by calling an alignment program), and finally merges all the aligned files into a single TNT matrix. The new version of TNT can make use of the taxonomic information contained in the terminal names, allowing easy diagnosis of results, evaluation of fit between the trees and the taxonomy, and automatic labelling or colouring of tree branches with the taxonomic groups they represent. © The Willi Hennig Society 2012.  相似文献   

15.
The accumulation of body mass, as growth, is fundamental to all organisms. Being able to understand which model(s) best describe this growth trajectory, both empirically and ultimately mechanistically, is an important challenge. A variety of equations have been proposed to describe growth during ontogeny. Recently, the West Brown Enquist (WBE) equation, formulated as part of the metabolic theory of ecology, has been proposed as a universal model of growth. This equation has the advantage of having a biological basis, but its ability to describe invertebrate growth patterns has not been well tested against other, more simple models. In this study, we collected data for 58 species of marine invertebrate from 15 different taxa. The data were fitted to three growth models (power, exponential and WBE), and their abilities were examined using an information theoretic approach. Using Akaike information criteria, we found changes in mass through time to fit an exponential equation form best (in approx. 73% of cases). The WBE model predominantly overestimates body size in early ontogeny and underestimates it in later ontogeny; it was the best fit in approximately 14% of cases. The exponential model described growth well in nine taxa, whereas the WBE described growth well in one of the 15 taxa, the Amphipoda. Although the WBE has the advantage of being developed with an underlying proximate mechanism, it provides a poor fit to the majority of marine invertebrates examined here, including species with determinate and indeterminate growth types. In the original formulation of the WBE model, it was tested almost exclusively against vertebrates, to which it fitted well; the model does not however appear to be universal given its poor ability to describe growth in benthic or pelagic marine invertebrates.  相似文献   

16.
Hansen’s disease (leprosy) elimination has proven difficult in several countries, including Brazil, and there is a need for a mathematical model that can predict control program efficacy. This study applied the Approximate Bayesian Computation algorithm to fit 6 different proposed models to each of the 5 regions of Brazil, then fitted hierarchical models based on the best-fit regional models to the entire country. The best model proposed for most regions was a simple model. Posterior checks found that the model results were more similar to the observed incidence after fitting than before, and that parameters varied slightly by region. Current control programs were predicted to require additional measures to eliminate Hansen’s Disease as a public health problem in Brazil.  相似文献   

17.
Modeling plant growth using functional traits is important for understanding the mechanisms that underpin growth and for predicting new situations. We use three data sets on plant height over time and two validation methods—in‐sample model fit and leave‐one‐species‐out cross‐validation—to evaluate non‐linear growth model predictive performance based on functional traits. In‐sample measures of model fit differed substantially from out‐of‐sample model predictive performance; the best fitting models were rarely the best predictive models. Careful selection of predictor variables reduced the bias in parameter estimates, and there was no single best model across our three data sets. Testing and comparing multiple model forms is important. We developed an R package with a formula interface for straightforward fitting and validation of hierarchical, non‐linear growth models. Our intent is to encourage thorough testing of multiple growth model forms and an increased emphasis on assessing model fit relative to a model's purpose.  相似文献   

18.
Grote MN 《Genetics》2007,176(4):2405-2420
I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.  相似文献   

19.
Soil carbon saturation: concept,evidence and evaluation   总被引:20,自引:0,他引:20  
Current estimates of soil C storage potential are based on models or factors that assume linearity between C input levels and C stocks at steady-state, implying that SOC stocks could increase without limit as C input levels increase. However, some soils show little or no increase in steady-state SOC stock with increasing C input levels suggesting that SOC can become saturated with respect to C input. We used long-term field experiment data to assess alternative hypotheses of soil carbon storage by three simple models: a linear model (no saturation), a one-pool whole-soil C saturation model, and a two-pool mixed model with C saturation of a single C pool, but not the whole soil. The one-pool C saturation model best fit the combined data from 14 sites, four individual sites were best-fit with the linear model, and no sites were best fit by the mixed model. These results indicate that existing agricultural field experiments generally have too small a range in C input levels to show saturation behavior, and verify the accepted linear relationship between soil C and C input used to model SOM dynamics. However, all sites combined and the site with the widest range in C input levels were best fit with the C-saturation model. Nevertheless, the same site produced distinct effective stabilization capacity curves rather than an absolute C saturation level. We conclude that the saturation of soil C does occur and therefore the greatest efficiency in soil C sequestration will be in soils further from C saturation.
Catherine E. StewartEmail:
  相似文献   

20.
Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号