首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
Traditional resampling-based tests for homogeneity in covariance matrices across multiple groups resample residuals, that is, data centered by group means. These residuals do not share the same second moments when the null hypothesis is false, which makes them difficult to use in the setting of multiple testing. An alternative approach is to resample standardized residuals, data centered by group sample means and standardized by group sample covariance matrices. This approach, however, has been observed to inflate type I error when sample size is small or data are generated from heavy-tailed distributions. We propose to improve this approach by using robust estimation for the first and second moments. We discuss two statistics: the Bartlett statistic and a statistic based on eigen-decomposition of sample covariance matrices. Both statistics can be expressed in terms of standardized errors under the null hypothesis. These methods are extended to test homogeneity in correlation matrices. Using simulation studies, we demonstrate that the robust resampling approach provides comparable or superior performance, relative to traditional approaches, for single testing and reasonable performance for multiple testing. The proposed methods are applied to data collected in an HIV vaccine trial to investigate possible determinants, including vaccine status, vaccine-induced immune response level and viral genotype, of unusual correlation pattern between HIV viral load and CD4 count in newly infected patients.  相似文献   

2.
Guan Y  Sherman M  Calvin JA 《Biometrics》2006,62(1):119-125
A common assumption while analyzing spatial point processes is direction invariance, i.e., isotropy. In this article, we propose a formal nonparametric approach to test for isotropy based on the asymptotic joint normality of the sample second-order intensity function. We derive an L(2) consistent subsampling estimator for the asymptotic covariance matrix of the sample second-order intensity function and use this to construct a test statistic with a chi(2) limiting distribution. We demonstrate the efficacy of the approach through simulation studies and an application to a desert plant data set, where our approach confirms suspected directional effects in the spatial distribution of the desert plant species.  相似文献   

3.
Maximum likelihood (ML) phylogenies based on 9,957 amino acid (AA) sites of 45 proteins encoded in the plastid genomes of Cyanophora, a diatom, a rhodophyte (red algae), a euglenophyte, and five land plants are compared with respect to several properties of the data, including between-site rate variation and aberrant amino acid composition in individual species. Neighbor-joining trees from AA LogDet distances and ML analyses are seen to be congruent when site rate variability was taken into account. Four feasible trees are identified in these analyses, one of which is preferred, and one of which is almost excluded by statistical criteria. A transition probability matrix for the general reversible Markov model of amino acid substitutions is estimated from the data, assuming each of these four trees. In all cases, the tree with diatom and rhodophyte as sister taxa was clearly favored. The new transition matrix based on the best tree, called cpREV, takes into account distinct substitution patterns in plastid-encoded proteins and should be useful in future ML inferences using such data. A second rate matrix, called cpREV*, based on a weighted sum of rate matrices from different trees, is also considered. Received: 3 June 1999 / Accepted: 26 November 1999  相似文献   

4.
黄蛟龙  曹致琦  张泽  朱大海 《遗传学报》2005,32(10):1027-1036
对于可观察到的分子序列进化模型的不同,提出一个相对简单的方法——卡方检测,来检测在DNA序列间替代过程的同质性。这个卡方检测方法不管在座位间下列3个条件是否满足皆是成立的:(1)替代率的异质性;(2)进化率/模型的相关性;(3)替代模型的变异。计算机模拟也显示出卡方检测在各种生物学条件下的序列进化模型是非常有效的。在真实数据中,11种节肢动物线粒体DNA的比较中发现,水蚤或卤虫与其他9种节肢动物以高百分比违背了同质性进化模型假设,显然是由于AT含量高而引起的,且在两种蚊子的线粒体DNA比较中发现,其满足同质性假设仅有7.69%。还比较了卡方淦测与Kumar and Gadagkar的ID检测之间的效能差异:在较为复杂的模型下,卡方检测效率在许多情况下较ID检测方法略高;并且在犯Ⅰ-型错误以及卡方测验的效率曲线中清楚地表明我们的方法是保守的,而Kumar等的方法是不保守的。  相似文献   

5.
A Space-Time Process Model for the Evolution of DNA Sequences   总被引:20,自引:3,他引:17       下载免费PDF全文
Z. Yang 《Genetics》1995,139(2):993-1005
We describe a model for the evolution of DNA sequences by nucleotide substitution, whereby nucleotide sites in the sequence evolve over time, whereas the rates of substitution are variable and correlated over sites. The temporal process used to describe substitutions between nucleotides is a continuous-time Markov process, with the four nucleotides as the states. The spatial process used to describe variation and dependence of substitution rates over sites is based on a serially correlated gamma distribution, i.e., an auto-gamma model assuming Markov-dependence of rates at adjacent sites. To achieve computational efficiency, we use several equal-probability categories to approximate the gamma distribution, and the result is an auto-discrete-gamma model for rates over sites. Correlation of rates at sites then is modeled by the Markov chain transition of rates at adjacent sites from one rate category to another, the states of the chain being the rate categories. Two versions of nonparametric models, which place no restrictions on the distributional forms of rates for sites, also are considered, assuming either independence or Markov dependence. The models are applied to data of a segment of mitochondrial genome from nine primate species. Model parameters are estimated by the maximum likelihood method, and models are compared by the likelihood ratio test. Tremendous variation of rates among sites in the sequence is revealed by the analyses, and when rate differences for different codon positions are appropriately accounted for in the models, substitution rates at adjacent sites are found to be strongly (positively) correlated. Robustness of the results to uncertainty of the phylogenetic tree linking the species is examined.  相似文献   

6.
Tamhane AC  Logan BR 《Biometrics》2002,58(3):650-656
Tang, Gnecco, and Geller (1989, Biometrika 76, 577-583) proposed an approximate likelihood ratio (ALR) test of the null hypothesis that a normal mean vector equals a null vector against the alternative that all of its components are nonnegative with at least one strictly positive. This test is useful for comparing a treatment group with a control group on multiple endpoints, and the data from the two groups are assumed to follow multivariate normal distributions with different mean vectors and a common covariance matrix (the homoscedastic case). Tang et al. derived the test statistic and its null distribution assuming a known covariance matrix. In practice, when the covariance matrix is estimated, the critical constants tabulated by Tang et al. result in a highly liberal test. To deal with this problem, we derive an accurate small-sample approximation to the null distribution of the ALR test statistic by using the moment matching method. The proposed approximation is then extended to the heteroscedastic case. The accuracy of both the approximations is verified by simulations. A real data example is given to illustrate the use of the approximations.  相似文献   

7.
Using the variance stabilizing technique, a product multinomial model is introduced to generate a new statistic to test observers' uncertainty in a weighted concordance analysis. Distance matrices which follow some specific rules are obtained by linear combinations of hierarchical distance matrices whose elements are equal to 0 or 1 and unit diagonal. The new statistic is compared with the kappa statistic interpreted by considering the covariance matrix generated by the data. By rewriting the test statistic in a barycentric form, one demonstrates how to modify the barycentric coefficients to derive an adequate measure of the interobserver agreement. The methods are illustrated using two examples.  相似文献   

8.
Several maximum likelihood and distance matrix methods for estimating phylogenetic trees from homologous DNA sequences were compared when substitution rates at sites were assumed to follow a gamma distribution. Computer simulations were performed to estimate the probabilities that various tree estimation methods recover the true tree topology. The case of four species was considered, and a few combinations of parameters were examined. Attention was applied to discriminating among different sources of error in tree reconstruction, i.e., the inconsistency of the tree estimation method, the sampling error in the estimated tree due to limited sequence length, and the sampling error in the estimated probability due to the number of simulations being limited. Compared to the least squares method based on pairwise distance estimates, the joint likelihood analysis is found to be more robust when rate variation over sites is present but ignored and an assumption is thus violated. With limited data, the likelihood method has a much higher probability of recovering the true tree and is therefore more efficient than the least squares method. The concept of statistical consistency of a tree estimation method and its implications were explored, and it is suggested that, while the efficiency (or sampling error) of a tree estimation method is a very important property, statistical consistency of the method over a wide range of, if not all, parameter values is prerequisite.  相似文献   

9.
Models of nucleotide substitution were constructed for combined analyses of heterogeneous sequence data (such as those of multiple genes) from the same set of species. The models account for different aspects of the heterogeneity in the evolutionary process of different genes, such as differences in nucleotide frequencies, in substitution rate bias (for example, the transition/transversion rate bias), and in the extent of rate variation across sites. Model parameters were estimated by maximum likelihood and the likelihood ratio test was used to test hypotheses concerning sequence evolution, such as rate constancy among lineages (the assumption of a molecular clock) and proportionality of branch lengths for different genes. The example data from a segment of the mitochondrial genome of six hominoid species (human, common and pygmy chimpanzees, gorilla, orangutan, and siamang) were analyzed. Nucleotides at the three codon positions in the protein-coding regions and from the tRNA-coding regions were considered heterogeneous data sets. Statistical tests showed that the amount of evolution in the sequence data reflected in the estimated branch lengths can be explained by the codon-position effect and lineage effect of substitution rates. The assumption of a molecular clock could not be rejected when the data were analyzed separately or when the rate variation among sites was ignored. However, significant differences in substitution rate among lineages were found when the data sets were combined and when the rate variation among sites was accounted for in the models. Under the assumption that the orangutan and African apes diverged 13 million years ago, the combined analysis of the sequence data estimated the times for the human-chimpanzee separation and for the separation of the gorilla as 4.3 and 6.8 million years ago, respectively.  相似文献   

10.

Background

Estimation of genetic covariance matrices for multivariate problems comprising more than a few traits is inherently problematic, since sampling variation increases dramatically with the number of traits. This paper investigates the efficacy of regularized estimation of covariance components in a maximum likelihood framework, imposing a penalty on the likelihood designed to reduce sampling variation. In particular, penalties that "borrow strength" from the phenotypic covariance matrix are considered.

Methods

An extensive simulation study was carried out to investigate the reduction in average ''loss'', i.e. the deviation in estimated matrices from the population values, and the accompanying bias for a range of parameter values and sample sizes. A number of penalties are examined, penalizing either the canonical eigenvalues or the genetic covariance or correlation matrices. In addition, several strategies to determine the amount of penalization to be applied, i.e. to estimate the appropriate tuning factor, are explored.

Results

It is shown that substantial reductions in loss for estimates of genetic covariance can be achieved for small to moderate sample sizes. While no penalty performed best overall, penalizing the variance among the estimated canonical eigenvalues on the logarithmic scale or shrinking the genetic towards the phenotypic correlation matrix appeared most advantageous. Estimating the tuning factor using cross-validation resulted in a loss reduction 10 to 15% less than that obtained if population values were known. Applying a mild penalty, chosen so that the deviation in likelihood from the maximum was non-significant, performed as well if not better than cross-validation and can be recommended as a pragmatic strategy.

Conclusions

Penalized maximum likelihood estimation provides the means to ''make the most'' of limited and precious data and facilitates more stable estimation for multi-dimensional analyses. It should become part of our everyday toolkit for multivariate estimation in quantitative genetics.  相似文献   

11.
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.  相似文献   

12.
We have investigated the effects of different among-site rate variation models on the estimation of substitution model parameters, branch lengths, topology, and bootstrap proportions under minimum evolution (ME) and maximum likelihood (ML). Specifically, we examined equal rates, invariable sites, gamma-distributed rates, and site-specific rates (SSR) models, using mitochondrial DNA sequence data from three protein-coding genes and one tRNA gene from species of the New Zealand cicada genus Maoricicada. Estimates of topology were relatively insensitive to the substitution model used; however, estimates of bootstrap support, branch lengths, and R-matrices (underlying relative substitution rate matrix) were strongly influenced by the assumptions of the substitution model. We identified one situation where ME and ML tree building became inaccurate when implemented with an inappropriate among-site rate variation model. Despite the fact the SSR models often have a better fit to the data than do invariable sites and gamma rates models, SSR models have some serious weaknesses. First, SSR rate parameters are not comparable across data sets, unlike the proportion of invariable sites or the alpha shape parameter of the gamma distribution. Second, the extreme among-site rate variation within codon positions is problematic for SSR models, which explicitly assume rate homogeneity within each rate class. Third, the SSR models appear to give severe underestimates of R-matrices and branch lengths relative to invariable sites and gamma rates models in this example. We recommend performing phylogenetic analyses under a range of substitution models to test the effects of model assumptions not only on estimates of topology but also on estimates of branch length and nodal support.  相似文献   

13.
Empirical models of substitution are often used in protein sequence analysis because the large alphabet of amino acids requires that many parameters be estimated in all but the simplest parametric models. When information about structure is used in the analysis of substitutions in structured RNA, a similar situation occurs. The number of parameters necessary to adequately describe the substitution process increases in order to model the substitution of paired bases. We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175-177; Wuyts et al. 2002. Nucleic Acids Res. 30:183-185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:997-1010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix. The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes. Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously.  相似文献   

14.
Amino acid substitution matrices play an essential role in protein sequence alignment, a fundamental task in bioinformatics. Most widely used matrices, such as PAM matrices derived from homologous sequences and BLOSUM matrices derived from aligned segments of PROSITE, did not integrate conformation information in their construction. There are a few structure-based matrices, which are derived from limited data of structure alignment. Using databases PDB_SELECT and DSSP, we create a database of sequence-conformation blocks which explicitly represent sequence-structure relationship. Members in a block are identical in conformation and are highly similar in sequence. From this block database, we derive a conformation-specific amino acid substitution matrix CBSM60. The matrix shows an improved performance in conformational segment search and homolog detection.  相似文献   

15.
Karin Meyer  Mark Kirkpatrick 《Genetics》2010,185(3):1097-1110
Obtaining accurate estimates of the genetic covariance matrix for multivariate data is a fundamental task in quantitative genetics and important for both evolutionary biologists and plant or animal breeders. Classical methods for estimating are well known to suffer from substantial sampling errors; importantly, its leading eigenvalues are systematically overestimated. This article proposes a framework that exploits information in the phenotypic covariance matrix in a new way to obtain more accurate estimates of . The approach focuses on the “canonical heritabilities” (the eigenvalues of ), which may be estimated with more precision than those of because is estimated more accurately. Our method uses penalized maximum likelihood and shrinkage to reduce bias in estimates of the canonical heritabilities. This in turn can be exploited to get substantial reductions in bias for estimates of the eigenvalues of and a reduction in sampling errors for estimates of . Simulations show that improvements are greatest when sample sizes are small and the canonical heritabilities are closely spaced. An application to data from beef cattle demonstrates the efficacy this approach and the effect on estimates of heritabilities and correlations. Penalized estimation is recommended for multivariate analyses involving more than a few traits or problems with limited data.QUANTITATIVE geneticists, including evolutionary biologists and plant and animal breeders, are increasingly dependent on multivariate analyses of genetic variation, for example, to understand evolutionary constraints and design efficient selection programs. New challenges arise when one moves from estimating the genetic variance of a single phenotype to the multivariate setting. An important but unresolved issue is how best to deal with sampling variation and the corresponding bias in the eigenvalues of estimates for the genetic covariance matrix, . It is well known that estimates for the largest eigenvalues of a covariance matrix are biased upward and those for the smallest eigenvalues are biased downward (Lawley 1956; Hayes and Hill 1981). For genetic problems, where we need to estimate at least two covariance matrices simultaneously, this tends to be exacerbated, especially for . In turn, this can result in invalid estimates of , i.e., estimates with negative eigenvalues, and can produce systematic errors in predictions for the response to selection.There has been longstanding interest in “regularization” of covariance matrices, in particular for cases where the ratio between the number of observations and the number of variables is small. Various studies recently employed such techniques for the analysis of high-dimensional, genomic data. In general, this involves a compromise between additional bias and reduced sampling variation of “improved” estimators that have less statistical risk than standard methods (Bickel and Li 2006). For instance, various types of shrinkage estimators of covariance matrices have been suggested that counteract bias in estimates of eigenvalues by shrinking all sample eigenvalues toward their mean. Often this is equivalent to a weighted combination of the sample covariance matrix and a target matrix, assumed to have a simple structure. A common choice for the latter is an identity matrix. This yields a ridge regression type formulation (Hoerl and Kennard 1970). Numerous simulation studies in a variety of settings are available, which demonstrate that regularization can yield closer agreement between estimated and population covariance matrices, less variable estimates of model terms, or improved performance of statistical tests.In quantitative genetic analyses, we attempt to partition observed, overall (phenotypic) covariances into their genetic and environmental components. Typically, this results in strong sampling correlations between them. Hence, while the partitioning into sources of variation and estimates of individual covariance matrices may be subject to substantial sampling variances, their sum, i.e., the phenotypic covariance matrix, can generally be estimated much more accurately. This has led to suggestions to “borrow strength” from estimates of phenotypic components to estimate the genetic covariances. In particular, Hayes and Hill (1981) proposed a method termed “bending” that involved regressing the eigenvalues of the product of the genetic and the inverse of the phenotypic covariance matrix toward their mean. One objective of this procedure was to ensure that estimates of the genetic covariance matrix from an analysis of variance were positive definite. In addition, the authors showed by simulation that shrinking eigenvalues even further than needed to make all values nonnegative could improve the achieved response to selection when using the resulting estimates to derive weights for a selection index, especially for estimation based on small samples. Subsequent work demonstrated that bending could also be advantageous in more general scenarios such as indexes that included information from relatives (Meyer and Hill 1983).Modern, mixed model (“animal model”)-based analyses to estimate genetic parameters using maximum likelihood or Bayesian methods generally constrain estimates to the parameter space, so that—at the expense of introducing some bias—estimates of covariance matrices are positive semidefinite. However, the problems arising from substantial sampling variation in multivariate analyses remain. In spite of increasing applications of such analyses in scenarios where data sets are invariably small, e.g., the analysis of data from natural populations (e.g., Kruuk et al. 2008), there has been little interest in regularization and shrinkage techniques in genetic parameter estimation, other than through the use of informative priors in a Bayesian context. Instead, suggestions for improved estimation have focused on parsimonious modeling of covariance matrices, e.g., through reduced rank estimation or by imposing a known structure, such as a factor-analytic structure (Kirkpatrick and Meyer 2004; Meyer 2009), or by fitting covariance functions for longitudinal data (Kirkpatrick et al. 1990). While such methods can be highly advantageous when the underlying assumptions are at least approximately correct, data-driven methods of regularization may be preferable in other scenarios.This article explores the scope for improved estimation of genetic covariance matrices by implementing the equivalent to bending within animal model-type analyses. We begin with a review of the underlying statistical principles (which the impatient reader might skip), examining the concept of improved estimation, its implementation via shrinkage estimators or penalized estimation, and selected applications. We then describe a penalized restricted maximum-likelihood (REML) procedure for the estimation of genetic covariance matrices that utilizes information from its phenotypic counterparts and present a simulation study demonstrating the effect of penalties on parameter estimates and their sampling properties. The article concludes with an application to a problem relevant in genetic improvement of beef cattle and a discussion.  相似文献   

16.
The problem of testing the separability of a covariance matrix against an unstructured variance‐covariance matrix is studied in the context of multivariate repeated measures data using Rao's score test (RST). The RST statistic is developed with the first component of the separable structure as a first‐order autoregressive (AR(1)) correlation matrix or an unstructured (UN) covariance matrix under the assumption of multivariate normality. It is shown that the distribution of the RST statistic under the null hypothesis of any separability does not depend on the true values of the mean or the unstructured components of the separable structure. A significant advantage of the RST is that it can be performed for small samples, even smaller than the dimension of the data, where the likelihood ratio test (LRT) cannot be used, and it outperforms the standard LRT in a number of contexts. Monte Carlo simulations are then used to study the comparative behavior of the null distribution of the RST statistic, as well as that of the LRT statistic, in terms of sample size considerations, and for the estimation of the empirical percentiles. Our findings are compared with existing results where the first component of the separable structure is a compound symmetry (CS) correlation matrix. It is also shown by simulations that the empirical null distribution of the RST statistic converges faster than the empirical null distribution of the LRT statistic to the limiting χ2 distribution. The tests are implemented on a real dataset from medical studies.  相似文献   

17.
Tests and model selection for the general growth curve model   总被引:1,自引:0,他引:1  
J C Lee 《Biometrics》1991,47(1):147-159
The model considered here is a generalized multivariate analysis of variance model useful especially for many types of growth curve problems including biological growth and technology substitution. It is defined as Yp x N = Xp x m tau m x r Ar x N + epsilon p x N, where tau is unknown, and X and A are known design matrices of ranks m less than p and r less than N, respectively. Furthermore, the columns of epsilon are independent p-variate normal with mean vector 0 and common covariance matrix sigma. In general, p is the number of time (or spatial) points observed on each of the N cases, (m - 1) is the degree of polynomial in time, and r is the number of groups. The main focus of this paper is the selection of models for the general growth curve model with regard to the covariance matrix sigma. Likelihood ratio tests and selection procedures based on sample reuse and predictions are proposed. Special emphasis is on the serial covariance structure for sigma, which has been shown to be quite important in the prediction of biological data and technology substitution data. One-population and K-population problems are considered. Some of the results are illustrated with two sets of biological data.  相似文献   

18.
An improved general amino acid replacement matrix   总被引:2,自引:0,他引:2  
Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.  相似文献   

19.
Using real sequence data, we evaluate the adequacy of assumptions made in evolutionary models of nucleotide substitution and the effects that these assumptions have on estimation of evolutionary trees. Two aspects of the assumptions are evaluated. The first concerns the pattern of nucleotide substitution, including equilibrium base frequencies and the transition/transversion-rate ratio. The second concerns the variation of substitution rates over sites. The maximum-likelihood estimate of tree topology appears quite robust to both these aspects of the assumptions of the models, but evaluation of the reliability of the estimated tree by using simpler, less realistic models can be misleading. Branch lengths are underestimated when simpler models of substitution are used, but the underestimation caused by ignoring rate variation over nucleotide sites is much more serious. The goodness of fit of a model is reduced by ignoring spatial rate variation, but unrealistic assumptions about the pattern of nucleotide substitution can lead to an extraordinary reduction in the likelihood. It seems that evolutionary biologists can obtain accurate estimates of certain evolutionary parameters even with an incorrect phylogeny, while systematists cannot get the right tree with confidence even when a realistic, and more complex, model of evolution is assumed.   相似文献   

20.
A phylogenetic comparative method is proposed for estimating historical effects on comparative data using the partitions that compose a cladogram, i.e., its monophyletic groups. Two basic matrices, Y and X, are defined in the context of an ordinary linear model. Y contains the comparative data measured over t taxa. X consists of an initial tree matrix that contains all the xj monophyletic groups (each coded separately as a binary indicator variable) of the phylogenetic tree available for those taxa. The method seeks to define the subset of groups, i.e., a reduced tree matrix, that best explains the patterns in Y. This definition is accomplished via regression or canonical ordination (depending on the dimensionality of Y) coupled with Monte Carlo permutations. It is argued here that unrestricted permutations (i.e., under an equiprobable model) are valid for testing this specific kind of groupwise hypothesis. Phylogeny is either partialled out or, more properly, incorporated into the analysis in the form of component variation. Direct extensions allow for testing ecomorphological data controlled by phylogeny in a variation partitioning approach. Currently available statistical techniques make this method applicable under most univariate/multivariate models and metrics; two-way phylogenetic effects can be estimated as well. The simplest case (univariate Y), tested with simulations, yielded acceptable type I error rates. Applications presented include examples from evolutionary ethology, ecology, and ecomorphology. Results showed that the new technique detected previously overlooked variation clearly associated with phylogeny and that many phylogenetic effects on comparative data may occur at particular groups rather than across the entire tree.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号