首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
KaKs_Calculator is a software package that calculates nonsynonymous (Ka) and synonymous (Ks) substitution rates through model selection and model averaging. Since existing methods for this estimation adopt their specific mutation (substitution) models that consider different evolutionary features, leading to diverse estimates, KaKs_Calculator implements a set of candidate models in a maximum likelihood framework and adopts the Akaike information criterion to measure fitness between models and data, aiming to include as many features as needed for accurately capturing evolutionary information in protein-coding sequences. In addition, several existing methods for calculating Ka and Ks are also incorporated into this software. KaKs_Calculator, including source codes, compiled executables, and documentation, is freely available for academic use at http://evolution.genomics.org.cn/software.htm.  相似文献   

2.
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.  相似文献   

3.
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (model-averaged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AIC-based model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus(genus Carabus) ground beetles described by Sota and Vogler (2001).  相似文献   

4.
Selecting the best-fit model of nucleotide substitution   总被引:2,自引:0,他引:2  
Despite the relevant role of models of nucleotide substitution in phylogenetics, choosing among different models remains a problem. Several statistical methods for selecting the model that best fits the data at hand have been proposed, but their absolute and relative performance has not yet been characterized. In this study, we compare under various conditions the performance of different hierarchical and dynamic likelihood ratio tests, and of Akaike and Bayesian information methods, for selecting best-fit models of nucleotide substitution. We specifically examine the role of the topology used to estimate the likelihood of the different models and the importance of the order in which hypotheses are tested. We do this by simulating DNA sequences under a known model of nucleotide substitution and recording how often this true model is recovered by the different methods. Our results suggest that model selection is reasonably accurate and indicate that some likelihood ratio test methods perform overall better than the Akaike or Bayesian information criteria. The tree used to estimate the likelihood scores does not influence model selection unless it is a randomly chosen tree. The order in which hypotheses are tested, and the complexity of the initial model in the sequence of tests, influence model selection in some cases. Model fitting in phylogenetics has been suggested for many years, yet many authors still arbitrarily choose their models, often using the default models implemented in standard computer programs for phylogenetic estimation. We show here that a best-fit model can be readily identified. Consequently, given the relevance of models, model fitting should be routine in any phylogenetic analysis that uses models of evolution.  相似文献   

5.
Longitudinal data are common in clinical trials and observational studies, where missing outcomes due to dropouts are always encountered. Under such context with the assumption of missing at random, the weighted generalized estimating equation (WGEE) approach is widely adopted for marginal analysis. Model selection on marginal mean regression is a crucial aspect of data analysis, and identifying an appropriate correlation structure for model fitting may also be of interest and importance. However, the existing information criteria for model selection in WGEE have limitations, such as separate criteria for the selection of marginal mean and correlation structures, unsatisfactory selection performance in small‐sample setups, and so forth. In particular, there are few studies to develop joint information criteria for selection of both marginal mean and correlation structures. In this work, by embedding empirical likelihood into the WGEE framework, we propose two innovative information criteria named a joint empirical Akaike information criterion and a joint empirical Bayesian information criterion, which can simultaneously select the variables for marginal mean regression and also correlation structure. Through extensive simulation studies, these empirical‐likelihood‐based criteria exhibit robustness, flexibility, and outperformance compared to the other criteria including the weighted quasi‐likelihood under the independence model criterion, the missing longitudinal information criterion, and the joint longitudinal information criterion. In addition, we provide a theoretical justification of our proposed criteria, and present two real data examples in practice for further illustration.  相似文献   

6.
Steel demonstrated that the maximum-likelihood function for a phylogenetic tree may have multiple local maxima. If this phenomenon were general, it would compromise the applicability of maximum likelihood as an optimality criterion for phylogenetic trees. In several simulation studies reported on in this paper, the true tree, and other trees of very high likelihood, rarely had multiple maxima. Our results thus provide reassurance that the value of maximum likelihood as a tree selection criterion is not compromised by the presence of multiple local maxima--the best estimates of the true tree are not likely to have them. This result holds true even when an incorrect nucleotide substitution model is used for tree selection.  相似文献   

7.
A few worker rehabilitation programs have had outstanding success in improving ability to function for persons with occupational back pain. Local programs must show that they have similar success. Because the definitions of terms such as "back school," "work hardening," and "functional restoration" are blurred at a local level, the choice of a program for an individual patient must depend primarily on the program''s demonstrated success rate with similar patients. The chances of returning to work decrease as a function of time after injury. Therefore, referring physicians, insurers, and employers must be provided with information regarding results in terms of acute (0 to 6 weeks), subacute (7 to 12 weeks), and chronic (more than 12 weeks) back pain. Other important variables include selection criteria, program cost, and dropout rate. We advocate standardized reporting of such data for all worker rehabilitation programs. A model "report to consumers," described here, is a minimal obligation. The validity of a number of important internal quality assurance issues is uncertain. Ethical and legal pressures must be recognized.  相似文献   

8.
Kinney SK  Dunson DB 《Biometrics》2007,63(3):690-698
We address the problem of selecting which variables should be included in the fixed and random components of logistic mixed effects models for correlated data. A fully Bayesian variable selection is implemented using a stochastic search Gibbs sampler to estimate the exact model-averaged posterior distribution. This approach automatically identifies subsets of predictors having nonzero fixed effect coefficients or nonzero random effects variance, while allowing uncertainty in the model selection process. Default priors are proposed for the variance components and an efficient parameter expansion Gibbs sampler is developed for posterior computation. The approach is illustrated using simulated data and an epidemiologic example.  相似文献   

9.
Summary: TOPALi v2 simplifies and automates the use of severalmethods for the evolutionary analysis of multiple sequence alignments.Jobs are submitted from a Java graphical user interface as TOPALiweb services to either run remotely on high-performance computingclusters or locally (with multiple cores supported). Methodsavailable include model selection and phylogenetic tree estimationusing the Bayesian inference and maximum likelihood (ML) approaches,in addition to recombination detection methods. The optimalsubstitution model can be selected for protein or nucleic acid(standard, or protein-coding using a codon position model) datausing accurate statistical criteria derived from ML co-estimationof the tree and the substitution model. Phylogenetic softwareavailable includes PhyML, RAxML and MrBayes. Availability: Freely downloadable from http://www.topali.orgfor Windows, Mac OS X, Linux and Solaris. Contact: iain.milne{at}scri.ac.uk Associate Editor: Martin Bishop  相似文献   

10.
GCHap quickly finds maximum likelihood estimates (MLEs) of frequencies of haplotypes given genotype information on a random sample of individuals. It uses the gene counting method but by excluding haplotypes with zero MLE at an early stage, this implementation uses many orders of magnitude less space and time than naive implementations. A second program, ApproxGCHap, is provided to give alternate estimates for data sets with large numbers of loci or large amounts of missing genotypes. AVAILABILITY: The Java classes and Javadocs pages for GCHap can be obtained from bioinformatics.med.utah.edu/~alun  相似文献   

11.
MOTIVATION: TipDate is a program that will use sequences that have been isolated at different dates to estimate their rate of molecular evolution. The program provides a maximum likelihood estimate of the rate and also the associated date of the most recent common ancestor of the sequences, under a model which assumes a constant rate of substitution (molecular clock) but which accommodates the dates of isolation. Confidence intervals for these parameters are also estimated. Results: The approach was applied to a sample of 17 dengue virus serotype 4 sequences, isolated at dates ranging from 1956 to 1994. The rate of substitution for this serotype was estimated to be 7.91 x 10(-4) substitutions per site per year (95% confidence intervals of 6.07 x 10(-4), 9.86 x 10(-4)). This is compatible with a date of 1922 (95% confidence intervals of 1900-1936) for the most recent common ancestor of these sequences. AVAILABILITY: TipDate can be obtained by WWW from http://evolve.zoo. ox.ac.uk/software. The package includes the source code, manual and example files. Both UNIX and Apple Macintosh versions are available from the same site.  相似文献   

12.
An improved general amino acid replacement matrix   总被引:2,自引:0,他引:2  
Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.  相似文献   

13.

Key message

We propose a criterion to predict genomic selection efficiency for structured populations. This criterion is useful to define optimal calibration set and to estimate prediction reliability for multiparental populations.

Abstract

Genomic selection refers to the use of genotypic information for predicting the performance of selection candidates. It has been shown that prediction accuracy depends on various parameters including the composition of the calibration set (CS). Assessing the level of accuracy of a given prediction scenario is of highest importance because it can be used to optimize CS sampling before collecting phenotypes, and once the breeding values are predicted it informs the breeders about the reliability of these predictions. Different criteria were proposed to optimize CS sampling in highly diverse panels, which can be useful to screen collections of genotypes. But plant breeders often work on structured material such as biparental or multiparental populations, for which these criteria are less adapted. We derived from the generalized coefficient of determination (CD) theory different criteria to optimize CS sampling and to assess the reliability associated to predictions in structured populations. These criteria were evaluated on two nested association mapping (NAM) populations and two highly diverse panels of maize. They were efficient to sample optimized CS in most situations. They could also estimate at least partly the reliability associated to predictions between NAM families, but they could not estimate differences in the reliability associated to the predictions of NAM families using the highly diverse panels as calibration sets. We illustrated that the CD criteria could be adapted to various prediction scenarios including inter and intra-family predictions, resulting in higher prediction accuracies.
  相似文献   

14.
Investigation of physiological mechanisms at a cellular level often requires production of high-quality antibodies, frequently using synthetic peptides as immunogens. Here we describe a new, web-based software tool called NHLBI-AbDesigner that allows the user to visualize the information needed to choose optimal peptide sequences for peptide-directed antibody production (http://helixweb.nih.gov/AbDesigner/). The choice of an immunizing peptide is generally based on a need to optimize immunogenicity, antibody specificity, multispecies conservation, and robustness in the face of posttranslational modifications (PTMs). AbDesigner displays information relevant to these criteria as follows: 1) "Immunogenicity Score," based on hydropathy and secondary structure prediction; 2) "Uniqueness Score," a predictor of specificity of an antibody against all proteins expressed in the same species; 3) "Conservation Score," a predictor of ability of the antibody to recognize orthologs in other animal species; and 4) "Protein Features" that show structural domains, variable regions, and annotated PTMs that may affect antibody performance. AbDesigner displays the information online in an interactive graphical user interface, which allows the user to recognize the trade-offs that exist for alternative synthetic peptide choices and to choose the one that is best for a proposed application. Several examples of the use of AbDesigner for the display of such trade-offs are presented, including production of a new antibody to Slc9a3. We also used the program in large-scale mode to create a database listing the 15-amino acid peptides with the highest Immunogenicity Scores for all known proteins in five animal species, one plant species (Arabidopsis thaliana), and Saccharomyces cerevisiae.  相似文献   

15.
The phylum Microsporidia comprises a species-rich group of minute, single-celled, and intra-cellular parasites. Lacking normal mitochondria and with unique cytology, microsporidians have sometimes been thought to be a lineage of ancient eukaryotes. Although phylogenetic analyses using small-subunit ribosomal RNA (SSU-rRNA) genes almost invariably place the Microsporidia among the earliest branches on the eukaryotic tree, many other molecules suggest instead a relationship with fungi. Using maximum likelihood methods and a diverse SSU-rRNA data set, we have re-evaluated the phylogenetic affiliations of Microsporidia. We demonstrate that tree topologies used to estimate likelihood model parameters can materially affect phylogenetic searches. We present a procedure for reducing this bias: "tree-based site partitioning," in which a comprehensive set of alternative topologies is used to estimate sequence data partitions based on inferred evolutionary rates. This hypothesis-driven approach appears to be capable of utilizing phylogenetic information that is not available to standard likelihood implementations (e.g., approximation to a gamma distribution); we have employed it in maximum likelihood and Bayesian analysis. Applying our method to a phylogenetically diverse SSU-rRNA data set revealed that the early diverging ("deep") placement of Microsporidia typically found in SSU-rRNA trees is no better than a fungal placement, and that the likeliest placement of Microsporidia among non-long-branch eukaryotic taxa is actually within fungi. These results illustrate the importance of hypothesis testing in parameter estimation, provide a way to address certain problems in difficult data sets, and support a fungal origin for the Microsporidia.  相似文献   

16.
Algorithmic details to obtain maximum likelihood estimates of parameters on a large phylogeny are discussed. On a large tree, an efficient approach is to optimize branch lengths one at a time while updating parameters in the substitution model simultaneously. Codon substitution models that allow for variable nonsynonymous/synonymous rate ratios (ω=d N/d S) among sites are used to analyze a data set of human influenza virus type A hemagglutinin (HA) genes. The data set has 349 sequences. Methods for obtaining approximate estimates of branch lengths for codon models are explored, and the estimates are used to test for positive selection and to identify sites under selection. Compared with results obtained from the exact method estimating all parameters by maximum likelihood, the approximate methods produced reliable results. The analysis identified a number of sites in the viral gene under diversifying Darwinian selection and demonstrated the importance of including many sequences in the data in detecting positive selection at individual sites. Received: 25 April 2000 / Accepted: 24 July 2000  相似文献   

17.
Miyazawa S 《PloS one》2011,6(3):e17244

Background

Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices.

Results

Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins.

Conclusions/Significance

The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.  相似文献   

18.
aflpop is a population allocation and simulator program based on amplified fragment length polymorphism markers. The allocation method is an adaptation of Paetkau's method for co‐dominant alleles. Besides population allocation of specimens of unknown origin, re‐allocation of sample genotypes, as well as allocation of artificial (Monte Carlo) specimens, may be run to estimate expected rates of correct allocations. Thanks to its embodied simulator, aflpop can provide information on the rates and types of incorrect allocations and on empirical distributions of likelihood statistics. A filtering procedure within aflpop allows the selection of loci according to user‐defined criteria.  相似文献   

19.
D Watt  S Verma  L Flynn 《CMAJ》1998,158(2):224-230
OBJECTIVE: To review studies that have examined an association between wellness programs and improvements in quality of life and to assess the strength of the scientific evidence. DATA SOURCES: A MEDLINE search was constructed with the following medical subject headings: "psychoneuroimmunology," "chronic disease" and "health promotion," "chronic disease" and "health behaviour," "relaxation techniques," "music therapy," "laughter," "anger," "mediation" and "behavioural medicine." Searches using the text words "wellness" and "wellness program" were also carried out. References from the primary articles identified in the search and contemporary writing on wellness were also considered. STUDY SELECTION: Selection was limited to randomized controlled trials or prospective studies published in English that involved human subjects and that took place between 1980 and 1996. All studies with an intervention aimed at promoting wellness and measuring outcomes were included, except studies of patients with cancer and HIV and studies of health promotion programs in the workplace. Of the 1082 references initially identified, 11 met the criteria for inclusion in the critical appraisal. DATA EXTRACTION: The following information was extracted from the 11 studies: characteristics of the study population, number of participants (and number followed to completion), length of follow-up, type of intervention, outcome measures and results. All 11 studies were assessed for the quality of their evidence. DATA SYNTHESIS: All studies reported some positive outcomes following the intervention in question, although many had limitations precluding applicability of the results to a wider population. CONCLUSIONS: Despite the suggested benefit associated with wellness programs, the evidence was inconclusive. Whether the composition of the target group or the type of intervention has a role in determining outcomes is unknown. Although trends suggest that wellness programs may be cost-effective, further research is needed for confirmation.  相似文献   

20.
MOTIVATION: It is well known that neighbouring nucleotides in DNA sequences do not mutate independently of each other. In this paper, we introduce a context-dependent substitution model and derive an algorithm to calculate the likelihood of sequences evolving under this model. We use this algorithm to estimate neighbour-dependent substitution rates, as well as rates for dinucleotide substitutions, using a Bayesian sampling procedure. The model is irreversible, giving an arrow to time, and allowing the position of the root between a pair of sequences to be inferred without using out-groups. RESULTS: We applied the model upon aligned human-mouse non-coding data. Clear neighbour dependencies were observed, including 17-18-fold increased CpG to TpG/CpA rates compared with other substitutions. Root inference positioned the root halfway the mouse and human tips, suggesting an approximately clock-like behaviour of the irreversible part of the substitution process.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号