首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 135 毫秒
1.
Codon-based substitution models are routinely used to measure selective pressures acting on protein-coding genes. To this effect, the nonsynonymous to synonymous rate ratio (dN/dS = omega) is estimated. The proportion of amino-acid sites potentially under positive selection, as indicated by omega > 1, is inferred by fitting a probability distribution where some sites are permitted to have omega > 1. These sites are then inferred by means of an empirical Bayes or by a Bayes empirical Bayes approach that, respectively, ignores or accounts for sampling errors in maximum-likelihood estimates of the distribution used to infer the proportion of sites with omega > 1. Here, we extend a previous full-Bayes approach to include models with high power and low false-positive rates when inferring sites under positive selection. We propose some heuristics to alleviate the computational burden, and show that (i) full Bayes can be superior to empirical Bayes when analyzing a small data set or small simulated data, (ii) full Bayes has only a small advantage over Bayes empirical Bayes with our small test data, and (iii) Bayesian methods appear relatively insensitive to mild misspecifications of the random process generating adaptive evolution in our simulations, but in practice can prove extremely sensitive to model specification. We suggest that the codon model used to detect amino acids under selection should be carefully selected, for instance using Akaike information criterion (AIC).  相似文献   

2.
Bayes prediction quantifies uncertainty by assigning posterior probabilities. It was used to identify amino acids in a protein under recurrent diversifying selection indicated by higher nonsynonymous (d(N)) than synonymous (d(S)) substitution rates or by omega = d(N)/d(S) > 1. Parameters were estimated by maximum likelihood under a codon substitution model that assumed several classes of sites with different omega ratios. The Bayes theorem was used to calculate the posterior probabilities of each site falling into these site classes. Here, we evaluate the performance of Bayes prediction of amino acids under positive selection by computer simulation. We measured the accuracy by the proportion of predicted sites that were truly under selection and the power by the proportion of true positively selected sites that were predicted by the method. The accuracy was slightly better for longer sequences, whereas the power was largely unaffected by the increase in sequence length. Both accuracy and power were higher for medium or highly diverged sequences than for similar sequences. We found that accuracy and power were unacceptably low when data contained only a few highly similar sequences. However, sampling a large number of lineages improved the performance substantially. Even for very similar sequences, accuracy and power can be high if over 100 taxa are used in the analysis. We make the following recommendations: (1) prediction of positive selection sites is not feasible for a few closely related sequences; (2) using a large number of lineages is the best way to improve the accuracy and power of the prediction; and (3) multiple models of heterogeneous selective pressures among sites should be applied in real data analysis.  相似文献   

3.
The nonsynonymous (amino acid-altering) to synonymous (silent) substitution rate ratio (omega = d(N)/d(S)) provides a measure of natural selection at the protein level, with omega = 1, >1, and <1, indicating neutral evolution, purifying selection, and positive selection, respectively. Previous studies that used this measure to detect positive selection have often taken an approach of pairwise comparison, estimating substitution rates by averaging over all sites in the protein. As most amino acids in a functional protein are under structural and functional constraints and adaptive evolution probably affects only a few sites at a few time points, this approach of averaging rates over sites and over time has little power. Previously, we developed codon-based substitution models that allow the omega ratio to vary either among lineages or among sites. In this paper we extend previous models to allow the omega ratio to vary both among sites and among lineages and implement the new models in the likelihood framework. These models may be useful for identifying positive selection along prespecified lineages that affects only a few sites in the protein. We apply those branch-site models as well as previous branch- and site-specific models to three data sets: the lysozyme genes from primates, the tumor suppressor BRCA1 genes from primates, and the phytochrome (PHY) gene family in angiosperms. Positive selection is detected in the lysozyme and BRCA genes by both the new and the old models. However, only the new models detected positive selection acting on lineages after gene duplication in the PHY gene family. Additional tests on several data sets suggest that the new models may be useful in detecting positive selection after gene duplication in gene family evolution.  相似文献   

4.
The selective pressure at the protein level is usually measured by the nonsynonymous/synonymous rate ratio (omega = dN/dS), with omega < 1, omega = 1, and omega > 1 indicating purifying (or negative) selection, neutral evolution, and diversifying (or positive) selection, respectively. The omega ratio is commonly calculated as an average over sites. As every functional protein has some amino acid sites under selective constraints, averaging rates across sites leads to low power to detect positive selection. Recently developed models of codon substitution allow the omega ratio to vary among sites and appear to be powerful in detecting positive selection in empirical data analysis. In this study, we used computer simulation to investigate the accuracy and power of the likelihood ratio test (LRT) in detecting positive selection at amino acid sites. The test compares two nested models: one that allows for sites under positive selection (with omega > 1), and another that does not, with the chi2 distribution used for significance testing. We found that use of the chi(2) distribution makes the test conservative, especially when the data contain very short and highly similar sequences. Nevertheless, the LRT is powerful. Although the power can be low with only 5 or 6 sequences in the data, it was nearly 100% in data sets of 17 sequences. Sequence length, sequence divergence, and the strength of positive selection also were found to affect the power of the LRT. The exact distribution assumed for the omega ratio over sites was found not to affect the effectiveness of the LRT.  相似文献   

5.
The nonsynonymous to synonymous substitution rate ratio (omega = d(N)/d(S)) provides a sensitive measure of selective pressure at the protein level, with omega values <1, =1, and >1 indicating purifying selection, neutral evolution, and diversifying selection, respectively. Maximum likelihood models of codon substitution developed recently account for variable selective pressures among amino acid sites by employing a statistical distribution for the omega ratio among sites. Those models, called random-sites models, are suitable when we do not know a priori which sites are under what kind of selective pressure. Sometimes prior information (such as the tertiary structure of the protein) might be available to partition sites in the protein into different classes, which are expected to be under different selective pressures. It is then sensible to use such information in the model. In this paper, we implement maximum likelihood models for prepartitioned data sets, which account for the heterogeneity among site partitions by using different omega parameters for the partitions. The models, referred to as fixed-sites models, are also useful for combined analysis of multiple genes from the same set of species. We apply the models to data sets of the major histocompatibility complex (MHC) class I alleles from human populations and of the abalone sperm lysin genes. Structural information is used to partition sites in MHC into two classes: those in the antigen recognition site (ARS) and those outside. Positive selection is detected in the ARS by the fixed-sites models. Similarly, sites in lysin are classified into the buried and solvent-exposed classes according to the tertiary structure, and positive selection was detected at the solvent-exposed sites. The random-sites models identified a number of sites under positive selection in each data set, confirming and elaborating the results of the fixed-sites models. The analysis demonstrates the utility of the fixed-sites models, as well as the power of previous random-sites models, which do not use the prior information to partition sites.  相似文献   

6.
We consider three approaches for estimating the rates of nonsynonymous and synonymous changes at each site in a sequence alignment in order to identify sites under positive or negative selection: (1) a suite of fast likelihood-based "counting methods" that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions; (2) a random effects likelihood (REL) approach, which models variation in nonsynonymous and synonymous rates across sites according to a predefined distribution, with the selection pressure at an individual site inferred using an empirical Bayes approach; and (3) a fixed effects likelihood (FEL) method that directly estimates nonsynonymous and synonymous substitution rates at each site. All three methods incorporate flexible models of nucleotide substitution bias and variation in both nonsynonymous and synonymous substitution rates across sites, facilitating the comparison between the methods. We demonstrate that the results obtained using these approaches show broad agreement in levels of Type I and Type II error and in estimates of substitution rates. Counting methods are well suited for large alignments, for which there is high power to detect positive and negative selection, but appear to underestimate the substitution rate. A REL approach, which is more computationally intensive than counting methods, has higher power than counting methods to detect selection in data sets of intermediate size but may suffer from higher rates of false positives for small data sets. A FEL approach appears to capture the pattern of rate variation better than counting methods or random effects models, does not suffer from as many false positives as random effects models for data sets comprising few sequences, and can be efficiently parallelized. Our results suggest that previously reported differences between results obtained by counting methods and random effects models arise due to a combination of the conservative nature of counting-based methods, the failure of current random effects models to allow for variation in synonymous substitution rates, and the naive application of random effects models to extremely sparse data sets. We demonstrate our methods on sequence data from the human immunodeficiency virus type 1 env and pol genes and simulated alignments.  相似文献   

7.
A popular approach to detecting positive selection is to estimate the parameters of a probabilistic model of codon evolution and perform inference based on its maximum likelihood parameter values. This approach has been evaluated intensively in a number of simulation studies and found to be robust when the available data set is large. However, uncertainties in the estimated parameter values can lead to errors in the inference, especially when the data set is small or there is insufficient divergence between the sequences. We introduce a Bayesian model comparison approach to infer whether the sequence as a whole contains sites at which the rate of nonsynonymous substitution is greater than the rate of synonymous substitution. We incorporated this probabilistic model comparison into a Bayesian approach to site-specific inference of positive selection. Using simulated sequences, we compared this approach to the commonly used empirical Bayes approach and investigated the effect of tree length on the performance of both methods. We found that the Bayesian approach outperforms the empirical Bayes method when the amount of sequence divergence is small and is less prone to false-positive inference when the sequences are saturated, while the results are indistinguishable for intermediate levels of sequence divergence.  相似文献   

8.
Anisimova M  Nielsen R  Yang Z 《Genetics》2003,164(3):1229-1236
Maximum-likelihood methods based on models of codon substitution accounting for heterogeneous selective pressures across sites have proved to be powerful in detecting positive selection in protein-coding DNA sequences. Those methods are phylogeny based and do not account for the effects of recombination. When recombination occurs, such as in population data, no unique tree topology can describe the evolutionary history of the whole sequence. This violation of assumptions raises serious concerns about the likelihood method for detecting positive selection. Here we use computer simulation to evaluate the reliability of the likelihood-ratio test (LRT) for positive selection in the presence of recombination. We examine three tests based on different models of variable selective pressures among sites. Sequences are simulated using a coalescent model with recombination and analyzed using codon-based likelihood models ignoring recombination. We find that the LRT is robust to low levels of recombination (with fewer than three recombination events in the history of a sample of 10 sequences). However, at higher levels of recombination, the type I error rate can be as high as 90%, especially when the null model in the LRT is unrealistic, and the test often mistakes recombination as evidence for positive selection. The test that compares the more realistic models M7 (beta) against M8 (beta and omega) is more robust to recombination, where the null model M7 allows the positive selection pressure to vary between 0 and 1 (and so does not account for positive selection), and the alternative model M8 allows an additional discrete class with omega = d(N)/d(S) that could be estimated to be >1 (and thus accounts for positive selection). Identification of sites under positive selection by the empirical Bayes method appears to be less affected than the LRT by recombination.  相似文献   

9.
While endothermy is ubiquitous in birds and mammals, it is not exclusive to these most recently arisen vertebrate classes. The ability to warm specific organs and/or tissues above ambient temperature (regional endothermy) has evolved at least three times in phylogentically discrete fish lineages: lamnid sharks (Lamnidae), tunas (Scombridae) and billfishes (Istiophoridae and Xiphidae). Given the links between endothermy and metabolic rate, we looked for evidence of convergent molecular evolution in mtDNA-encoded cytochrome c oxidase (COX) subunits in each of these discrete lineages. We found no evidence that the endothermic phenotype in fishes is driven or accompanied by molecular convergence. Though we found little evidence for positively-selected sites in any of the lineages in any subunit, the conclusions were sensitive to the choice of maximum-likelihood model. Several sites identified by Na?ve Empirical Bayes (NEB) were not found when Bayes Empirical Bayes (BEB) was employed. As well, conclusions were profoundly influenced by taxon-sampling. Several of the putative sites of positive selection in COX II were no longer apparent as we augmented taxon sampling. The lack of convergent molecular evolution in these remarkable taxa, combined with the profound influence of model choice and taxon sampling provide a cautionary note on the use of rates of non-synonymous to synonymous mutations (dN/dS) to explore questions of the evolution of physiological function.  相似文献   

10.
Yang Z  Nielsen R  Goldman N  Pedersen AM 《Genetics》2000,155(1):431-449
Comparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (omega = d(N)d(S)) is an important indicator of selective pressure at the protein level, with omega = 1 meaning neutral mutations, omega < 1 purifying selection, and omega > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying omega ratios. We develop models that account for heterogeneous omega ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of omega among sites. In all data sets analyzed, the selective pressure indicated by the omega ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average omega ratio across sites is <1, but in which some sites are clearly under diversifying selection with omega > 1. Genes undergoing positive selection include the beta-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for omega and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of omega among sites from real data sets.  相似文献   

11.
彭阳  苏应娟  王艇 《植物学报》2020,55(3):287-298
rpoC1基因编码RNA聚合酶β°亚基蛋白, 在转录过程中与DNA模板结合, 与β亚基形成的β-β°亚基复合体构成RNA合成的催化中心。以rpoC1基因为研究对象, 在贝叶斯因子大于20的条件下, 用HyPhy软件位点模型检测到3个正选择位点和541个负选择位点; 用PAML软件位点模型检测到10个正选择位点, 其中3个位点的后验概率超过99%。此外, 基于最大似然法构建64种蕨类植物的系统发育树, 结合HyPhy软件分析rpoC1基因的转换率、颠换率、转换率/颠换率、同义替换率、非同义替换率以及同义替换率/非同义替换率, 探讨rpoC1基因内含子丢失与分子进化速率的关系。结果表明, rpoC1基因内含子缺失对转换率、颠换率以及非同义替换率有一定影响。  相似文献   

12.
The ratio of nonsynonymous (dN) to synonymous (dS) substitution rates, omega, provides a measure of selection at the protein level. Models have been developed that allow omega to vary among lineages. However, these models require the lineages in which differential selection has acted to be specified a priori. We propose a genetic algorithm approach to assign lineages in a phylogeny to a fixed number of different classes of omega, thus allowing variable selection pressure without a priori specification of particular lineages. This approach can identify models with a better fit than a single-ratio model, and with fits that are better than (in an information theoretic sense) a fully local model, in which all lineages are assumed to evolve under different values of omega, but with far fewer parameters. By averaging over models which explain the data reasonably well, we can assess the robustness of our conclusions to uncertainty in model estimation. Our approach can also be used to compare results from models in which branch classes are specified a priori with a wide range of credible models. We illustrate our methods on primate lysozyme sequences and compare them with previous methods applied to the same data sets.  相似文献   

13.
Detecting positive Darwinian selection at the DNA sequence level has been a subject of considerable interest. However, positive selection is difficult to detect because it often operates episodically on a few amino acid sites, and the signal may be masked by negative selection. Several methods have been developed to test positive selection that acts on given branches (branch methods) or on a subset of sites (site methods). Recently, Yang, Z., and R. Nielsen (2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917) developed likelihood ratio tests (LRTs) based on branch-site models to detect positive selection that affects a small number of sites along prespecified lineages. However, computer simulations suggested that the tests were sensitive to the model assumptions and were unable to distinguish between relaxation of selective constraint and positive selection (Zhang, J. 2004. Frequent false detection of positive selection by the likelihood method with branch-site models. Mol. Biol. Evol. 21:1332-1339). Here, we describe a modified branch-site model and use it to construct two LRTs, called branch-site tests 1 and 2. We applied the new tests to reanalyze several real data sets and used computer simulation to examine the performance of the two tests by examining their false-positive rate, power, and robustness. We found that test 1 was unable to distinguish relaxed constraint from positive selection affecting the lineages of interest, while test 2 had acceptable false-positive rates and appeared robust against violations of model assumptions. As test 2 is a direct test of positive selection on the lineages of interest, it is referred to as the branch-site test of positive selection and is recommended for use in real data analysis. The test appeared conservative overall, but exhibited better power in detecting positive selection than the branch-based test. Bayes empirical Bayes identification of amino acid sites under positive selection along the foreground branches was found to be reliable, but lacked power.  相似文献   

14.
A class of nonparametric statistical methods, including a nonparametric empirical Bayes (EB) method, the Significance Analysis of Microarrays (SAM) and the mixture model method (MMM) have been proposed to detect differential gene expression for replicated microarray experiments. They all depend on constructing a test statistic, for example, a t-statistic, and then using permutation to draw inferences. However, due to special features of microarray data, using standard permutation scores may not estimate the null distribution of the test statistic well, leading to possibly too conservative inferences. We propose a new method of constructing weighted permutation scores to overcome the problem: posterior probabilities of having no differential expression from the EB method are used as weights for genes to better estimate the null distribution of the test statistic. We also propose a weighted method to estimate the false discovery rate (FDR) using the posterior probabilities. Using simulated data and real data for time-course microarray experiments, we show the improved performance of the proposed methods when implemented in MMM, EB and SAM.  相似文献   

15.
Recent work on Bayesian inference of disease mapping models discusses the advantages of the fully Bayesian (FB) approach over its empirical Bayes (EB) counterpart, suggesting that FB posterior standard deviations of small-area relative risks are more reflective of the uncertainty associated with the relative risk estimation than counterparts based on EB inference, since the latter fail to account for the variability in the estimation of the hyperparameters. In this article, an EB bootstrap methodology for relative risk inference with accurate parametric EB confidence intervals is developed, illustrated, and contrasted with the hyperprior Bayes. We elucidate the close connection between the EB bootstrap methodology and hyperprior Bayes, present a comparison between FB inference via hybrid Markov chain Monte Carlo and EB inference via penalized quasi-likelihood, and illustrate the ability of parametric bootstrap procedures to adjust for the undercoverage in the "naive" EB interval estimates. We discuss the important roles that FB and EB methods play in risk inference, map interpretation, and real-life applications. The work is motivated by a recent analysis of small-area infant mortality rates in the province of British Columbia in Canada.  相似文献   

16.
Ten Have TR  Localio AR 《Biometrics》1999,55(4):1022-1029
We extend an approach for estimating random effects parameters under a random intercept and slope logistic regression model to include standard errors, thereby including confidence intervals. The procedure entails numerical integration to yield posterior empirical Bayes (EB) estimates of random effects parameters and their corresponding posterior standard errors. We incorporate an adjustment of the standard error due to Kass and Steffey (KS; 1989, Journal of the American Statistical Association 84, 717-726) to account for the variability in estimating the variance component of the random effects distribution. In assessing health care providers with respect to adult pneumonia mortality, comparisons are made with the penalized quasi-likelihood (PQL) approximation approach of Breslow and Clayton (1993, Journal of the American Statistical Association 88, 9-25) and a Bayesian approach. To make comparisons with an EB method previously reported in the literature, we apply these approaches to crossover trials data previously analyzed with the estimating equations EB approach of Waclawiw and Liang (1994, Statistics in Medicine 13, 541-551). We also perform simulations to compare the proposed KS and PQL approaches. These two approaches lead to EB estimates of random effects parameters with similar asymptotic bias. However, for many clusters with small cluster size, the proposed KS approach does better than the PQL procedures in terms of coverage of nominal 95% confidence intervals for random effects estimates. For large cluster sizes and a few clusters, the PQL approach performs better than the KS adjustment. These simulation results agree somewhat with those of the data analyses.  相似文献   

17.
Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 x 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.  相似文献   

18.
The tissue-specific expression and differential function of the crustacean hyperglycemic hormone (CHH) in Carcinus maenas indicate an interesting evolutionary history. Previous studies have shown that CHH from the sinus gland X-organ (XO-type) has hyperglycemic activity, whereas the CHH from the pericardial organ (PO-type) neither shows hyperglycemic activity nor it inhibits Y-organ ecdysteroid synthesis. Here we examined the types of selective pressures operating on the variants of CHH in Carcinus maenas. Maximum likelihood-based codon substitution analyses revealed that the variants of this neuropeptide in C. maenas have been subjected to positive Darwinian selection indicating adaptive evolution and functional divergence among the CHH variants leading to two unique groups (PO and XO-type). Although the average ratio of nonsynonymous to synonymous substitution (omega) for the entire coding region is 0.5096, few codon sites showed significantly higher omega (10.95). Comparison of models that incorporate positive selection (omega > 1) with models not incorporating positive selection (omega <1) at certain codon sites failed to reject (p=0) evidence of positive Darwinian selection.  相似文献   

19.
Probabilistic tests of topology offer a powerful means of evaluating competing phylogenetic hypotheses. The performance of the nonparametric Shimodaira-Hasegawa (SH) test, the parametric Swofford-Olsen-Waddell-Hillis (SOWH) test, and Bayesian posterior probabilities were explored for five data sets for which all the phylogenetic relationships are known with a very high degree of certainty. These results are consistent with previous simulation studies that have indicated a tendency for the SOWH test to be prone to generating Type 1 errors because of model misspecification coupled with branch length heterogeneity. These results also suggest that the SOWH test may accord overconfidence in the true topology when the null hypothesis is in fact correct. In contrast, the SH test was observed to be much more conservative, even under high substitution rates and branch length heterogeneity. For some of those data sets where the SOWH test proved misleading, the Bayesian posterior probabilities were also misleading. The results of all tests were strongly influenced by the exact substitution model assumptions. Simple models, especially those that assume rate homogeneity among sites, had a higher Type 1 error rate and were more likely to generate misleading posterior probabilities. For some of these data sets, the commonly used substitution models appear to be inadequate for estimating appropriate levels of uncertainty with the SOWH test and Bayesian methods. Reasons for the differences in statistical power between the two maximum likelihood tests are discussed and are contrasted with the Bayesian approach.  相似文献   

20.
Differential selection of genes of cucumber mosaic virus subgroups   总被引:1,自引:0,他引:1  
Cucumber mosaic virus (CMV) has an extremely broad plant-host range, a large number of vector species, and a wide geographical distribution. CMV is, therefore, a model by which to understand plant virus adaptation. The selective constraints exerted on the five proteins expressed from the CMV genome were evaluated by application of newly developed maximum-likelihood algorithms to analyze sequences available in data banks. The ratio between nonsynonymous and synonymous substitution rates (omega) was used to detect positive selection on particular codon sites. Amino acid sequences were conserved with omega ranging from 0.07 to 0.60 in different proteins. However, a small proportion of amino acids in proteins 1a, 2a, and 3b, the coat protein (CP), were positively selected (omega > 1). Moreover, the evolution of the CP in the three subgroups of CMV strains revealed different selection profiles along the sequence and significantly different speed of evolution at many positions. Constraints exerted by aphid transmission, rather than plant adaptation, seemed to be responsible for these patterns of evolution in the CP.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号