共查询到20条相似文献,搜索用时 15 毫秒
1.
Background
Thanks to the large amount of signal contained in genome-wide sequence alignments, phylogenomic analyses are converging towards highly supported trees. However, high statistical support does not imply that the tree is accurate. Systematic errors, such as the Long Branch Attraction (LBA) artefact, can be misleading, in particular when the taxon sampling is poor, or the outgroup is distant. In an otherwise consistent probabilistic framework, systematic errors in genome-wide analyses can be traced back to model mis-specification problems, which suggests that better models of sequence evolution should be devised, that would be more robust to tree reconstruction artefacts, even under the most challenging conditions.Methods
We focus on a well characterized LBA artefact analyzed in a previous phylogenomic study of the metazoan tree, in which two fast-evolving animal phyla, nematodes and platyhelminths, emerge either at the base of all other Bilateria, or within protostomes, depending on the outgroup. We use this artefactual result as a case study for comparing the robustness of two alternative models: a standard, site-homogeneous model, based on an empirical matrix of amino-acid replacement (WAG), and a site-heterogeneous mixture model (CAT). In parallel, we propose a posterior predictive test, allowing one to measure how well a model acknowledges sequence saturation.Results
Adopting a Bayesian framework, we show that the LBA artefact observed under WAG disappears when the site-heterogeneous model CAT is used. Using cross-validation, we further demonstrate that CAT has a better statistical fit than WAG on this data set. Finally, using our statistical goodness-of-fit test, we show that CAT, but not WAG, correctly accounts for the overall level of saturation, and that this is due to a better estimation of site-specific amino-acid preferences.Conclusion
The CAT model appears to be more robust than WAG against LBA artefacts, essentially because it correctly anticipates the high probability of convergences and reversions implied by the small effective size of the amino-acid alphabet at each site of the alignment. More generally, our results provide strong evidence that site-specificities in the substitution process need be accounted for in order to obtain more reliable phylogenetic trees.2.
Background
Probabilistic methods have progressively supplanted the Maximum Parsimony (MP) method for inferring phylogenetic trees. One of the major reasons for this shift was that MP is much more sensitive to the Long Branch Attraction (LBA) artefact than is Maximum Likelihood (ML). However, recent work by Kolaczkowski and Thornton suggested, on the basis of simulations, that MP is less sensitive than ML to tree reconstruction artefacts generated by heterotachy, a phenomenon that corresponds to shifts in site-specific evolutionary rates over time. These results led these authors to recommend that the results of ML and MP analyses should be both reported and interpreted with the same caution. This specific conclusion revived the debate on the choice of the most accurate phylogenetic method for analysing real data in which various types of heterogeneities occur. However, variation of evolutionary rates across species was not explicitly incorporated in the original study of Kolaczkowski and Thornton, and in most of the subsequent heterotachous simulations published to date, where all terminal branch lengths were kept equal, an assumption that is biologically unrealistic. 相似文献3.
A review of long-branch attraction 总被引:24,自引:1,他引:24
Johannes Bergsten 《Cladistics : the international journal of the Willi Hennig Society》2005,21(2):163-193
The history of long‐branch attraction, and in particular methods suggested to detect and avoid the artifact to date, is reviewed. Methods suggested to avoid LBA‐artifacts include excluding long‐branch taxa, excluding faster evolving third codon positions, using inference methods less sensitive to LBA such as likelihood, the Aguinaldo et al. approach, sampling more taxa to break up long branches and sampling more characters especially of another kind, and the pros and cons of these are discussed. Methods suggested to detect LBA are numerous and include methodological disconcordance, RASA, separate partition analyses, parametric simulation, random outgroup sequences, long‐branch extraction, split decomposition and spectral analysis. Less than 10 years ago it was doubted if LBA occurred in real datasets. Today, examples are numerous in the literature and it is argued that the development of methods to deal with the problem is warranted. A 16 kbp dataset of placental mammals and a morphological and molecular combined dataset of gall waSPS are used to illustrate the particularly common problem of LBA of problematic ingroup taxa to outgroups. The preferred methods of separate partition analysis, methodological disconcordance, and long branch extraction are used to demonstrate detection methods. It is argued that since outgroup taxa almost always represent long branches and are as such a hazard towards misplacing long branched ingroup taxa, phylogenetic analyses should always be run with and without the outgroups included. This will detect whether only the outgroup roots the ingroup or if it simultaneously alters the ingroup topology, in which case previous studies have shown that the latter is most often the worse. Apart from that LBA to outgroups is the major and most common problem; scanning the literature also detected the ill advised comfort of high support values from thousands of characters, but very few taxa, in the age of genomics. Taxon sampling is crucial for an accurate phylogenetic estimate and trust cannot be put on whole mitochondrial or chloroplast genome studies with only a few taxa, despite their high support values. The placental mammal example demonstrates that parsimony analysis will be prone to LBA by the attraction of the tenrec to the distant marsupial outgroups. In addition, the murid rodents, creating the classic “the guinea‐pig is not a rodent” hypothesis in 1996, are also shown to be attracted to the outgroup by nuclear genes, although including the morphological evidence for rodents and Glires overcomes the artifact. The gall wasp example illustrates that Bayesian analyses with a partition‐specific GTR + Γ + I model give a conflicting resolution of clades, with a posterior probability of 1.0 when comparing ingroup alone versus outgroup rooted topologies, and this is due to long‐branch attraction to the outgroup. © The Willi Hennig Society 2005. 相似文献
4.
Bodilis J Nsigue Meilo S Cornelis P De Vos P Barray S 《Molecular biology and evolution》2011,28(10):2723-2726
A significant proportion of protein-encoding gene phylogenies in bacteria is inconsistent with the species phylogeny. It was usually argued that such inconsistencies resulted from lateral transfers. Here, by further studying the phylogeny of the oprF gene encoding the major surface protein in the bacterial Pseudomonas genus, we found that the incongruent tree topology observed results from a long-branch attraction (LBA) artifact and not from lateral transfers. LBA in the oprF phylogeny could be explained by the faster evolution in a lineage adapted to the rhizosphere, highlighting an unexpected adaptive radiation. We argue that analysis of such artifacts in other inconsistent bacterial phylogenies could be a valuable tool in molecular ecology to highlight cryptic adaptive radiations in microorganisms. 相似文献
5.
6.
Recent studies based on different types of data (i.e., morphology, molecules) have found strongly conflicting phylogenies for the genera of iguanid lizards but have been unable to explain the basis for this incongruence. We reanalyze published data from morphology and from the mitochondrial ND4, cytochrome b, 12S, and 16S genes to explore the sources of incongruence and resolve these conflicts. Much of the incongruence centers on the genus Cyclura, which is the sister taxon of Iguana, according to parsimony analyses of the morphology and the ribosomal genes, but is the sister taxon of all other Iguanini, according to the protein-coding genes. Maximum likelihood analyses show that there has been an increase in the rate of nucleotide substitution in Cyclura in the two protein-coding genes (ND4 and cytochrome b), although this increase is not as clear when parsimony is used to estimate branch lengths. Parametric simulations suggest that Cyclura may be misplaced by the protein-coding genes as a result of long-branch attraction; even when Cyclura and Iguana are sister taxa in a simulated phylogeny, Cyclura is still placed as the basal member of the Iguanini by parsimony analysis in 55% of the replicates. A similar long-branch attraction problem may also exist in the morphological data with regard to the placement of Sauromalus with the Galápagos iguanas (Amblyrhynchus and Conolophus). The results have many implications for the analysis of diverse data sets, the impact of long branches on parsimony and likelihood methods, and the use of certain protein-coding genes in phylogeny reconstruction. 相似文献
7.
Whole-genome duplication (WGD) produces sets of gene pairs that are all of the same age. We therefore expect that phylogenetic trees that relate these pairs to their orthologs in other species should show a single consistent topology. However, a previous study of gene pairs formed by WGD in the yeast Saccharomyces cerevisiae found conflicting topologies among neighbor-joining (NJ) trees drawn from different loci and suggested that this conflict was the result of "asynchronous functional divergence" of duplicated genes (Langkjaer, R. B., P. F. Cliften, M. Johnston, and J. Piskur. 2003. Yeast genome duplication was followed by asynchronous differentiation of duplicated genes. Nature 421:848-852). Here, we test whether the conflicting topologies might instead be due to asymmetrical rates of evolution leading to long-branch attraction (LBA) artifacts in phylogenetic trees. We constructed trees for 433 pairs of WGD paralogs in S. cerevisiae with their single orthologs in Saccharomyces kluyveri and Candida albicans. We find a strong correlation between the asymmetry of evolutionary rates of a pair of S. cerevisiae paralogs and the topology of the tree inferred for that pair. Saccharomyces cerevisiae gene pairs with approximately equal rates of evolution tend to give phylogenies in which the WGD postdates the speciation between S. cerevisiae and S. kluyveri (B-trees), whereas trees drawn from gene pairs with asymmetrical rates tend to show WGD pre-dating this speciation (A-trees). Gene order data from throughout the genome indicate that the "A-trees" are artifacts, even though more than 50% of gene pairs are inferred to have this topology when the NJ method as implemented in ClustalW (i.e., with Poisson correction of distances) is used to construct the trees. This LBA artifact can be ameliorated, but not eliminated, by using gamma-corrected distances or by using maximum likelihood trees with robustness estimated by the Shimodaira-Hasegawa test. Tests for adaptive evolution indicated that positive selection might be the cause of rate asymmetry in a substantial fraction (19%) of the paralog pairs. 相似文献
8.
Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1alpha phylogenies 总被引:1,自引:0,他引:1
Microsporidia branch at the base of eukaryotic phylogenies inferred from translation elongation factor 1alpha (EF-1alpha) sequences. Because these parasitic eukaryotes are fungi (or close relatives of fungi), it is widely accepted that fast-evolving microsporidian sequences are artifactually "attracted" to the long branch leading to the archaebacterial (outgroup) sequences ("long-branch attraction," or "LBA"). However, no previous studies have explicitly determined the reason(s) why the artifactual allegiance of microsporidia and archaebacteria ("M + A") is recovered by all phylogenetic methods, including maximum likelihood, a method that is supposed to be resistant to classical LBA. Here we show that the M + A affinity can be attributed to those alignment sites associated with large differences in evolutionary site rates between the eukaryotic and archaebacterial subtrees. Therefore, failure to model the significant evolutionary rate distribution differences (covarion shifts) between the ingroup and outgroup sequences is apparently responsible for the artifactual basal position of microsporidia in phylogenetic analyses of EF-1alpha sequences. Currently, no evolutionary model that accounts for discrete changes in the site rate distribution on particular branches is available for either protein or nucleotide level phylogenetic analysis, so the same artifacts may affect many other "deep" phylogenies. Furthermore, given the relative similarity of the site rate patterns of microsporidian and archaebacterial EF-1alpha proteins ("parallel site rate variation"), we suggest that the microsporidian orthologs may have lost some eukaryotic EF-1alpha-specific nontranslational functions, exemplifying the extreme degree of reduction in this parasitic lineage. 相似文献
9.
Rate acceleration and long-branch attraction in a conserved gene of cryptic daphniid (Crustacea) species. 总被引:6,自引:0,他引:6
The nuclear large subunit (LSU) rRNA gene is a rich source of phylogenetic characters because of its large size, mosaic of slowly and rapidly evolving regions, and complex secondary structure variation. Nevertheless, many studies have indicated that inconsistency, bias, and gene-specific error (e.g., within-individual gene family variation, cryptic sequence simplicity, and sequence coevolution) can complicate animal phylogenies based on LSU rDNA sequences. However, most of these studies sampled small gene fragments from expansion segments--among animals only five nonchordate complete LSU sequences are published. In this study, we sequenced near-complete nuclear LSU genes from 11 representative daphniids (Crustacea). The daphniid expansion segment V6 was larger and showed more length variation (90-351 bp) than is found in all other reported LSU V6 sequences. Daphniid LSU (without the V6 region) phylogenies generally agreed with the existing phylogenies based on morphology and mtDNA sequences. Nevertheless, a major disagreement between the LSU and the expected trees involved a positively misleading association between the two taxa with the longest branches, Daphnia laevis and D. occidentalis. Both maximum parsimony (MP) and maximum likelihood (ML) optimality criteria recovered this association, but parametric simulations indicated that MP was markedly more sensitive to this bias than ML. Examination of data partitions indicated that the inconsistency was caused by increased nucleotide substitution rates in the branches leading to D. laevis and D. occidentalis rather than among-taxon differences in base composition or distribution of sites that are free to vary. These results suggest that lineage-specific rate acceleration can lead to long-branch attraction even in the conserved genes of animal species that are almost morphologically indistinguishable. 相似文献
10.
Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants 总被引:8,自引:0,他引:8
Sanderson MJ Wojciechowski MF Hu JM Khan TS Brady SG 《Molecular biology and evolution》2000,17(5):782-797
Sequences of two chloroplast photosystem genes, psaA and psbB, together comprising about 3,500 bp, were obtained for all five major groups of extant seed plants and several outgroups among other vascular plants. Strongly supported, but significantly conflicting, phylogenetic signals were obtained in parsimony analyses from partitions of the data into first and second codon positions versus third positions. In the former, both genes agreed on a monophyletic gymnosperms, with Gnetales closely related to certain conifers. In the latter, Gnetales are inferred to be the sister group of all other seed plants, with gymnosperms paraphyletic. None of the data supported the modern "anthophyte hypothesis," which places Gnetales as the sister group of flowering plants. A series of simulation studies were undertaken to examine the error rate for parsimony inference. Three kinds of errors were examined: random error, systematic bias (both properties of finite data sets), and statistical inconsistency owing to long-branch attraction (an asymptotic property). Parsimony reconstructions were extremely biased for third-position data for psbB. Regardless of the true underlying tree, a tree in which Gnetales are sister to all other seed plants was likely to be reconstructed for these data. None of the combinations of genes or partitions permits the anthophyte tree to be reconstructed with high probability. Simulations of progressively larger data sets indicate the existence of long-branch attraction (statistical inconsistency) for third-position psbB data if either the anthophyte tree or the gymnosperm tree is correct. This is also true for the anthophyte tree using either psaA third positions or psbB first and second positions. A factor contributing to bias and inconsistency is extremely short branches at the base of the seed plant radiation, coupled with extremely high rates in Gnetales and nonseed plant outgroups. 相似文献
11.
12.
Phylogenetic analyses of ribosomal RNA genes have become widely accepted as a framework for understanding broad-scale eukaryotic evolution. Nevertheless, conflicts exist between the phylogenetic placement of certain taxa in rDNA trees and their expected position based on fossils, cytology, or protein-encoding gene sequences. For example, pelobiont amoebae appear to be an ancient group based on cytologic features, but they are not among the early eukaryotic brances in rDNA analyses. In this report, the derived position of pelobionts in rDNA trees is shown to be unreliable and likely due to long-branch attraction among more deeply branching sequences. All sequences that branch near the base of the tree suffer from relatively high apparent substitution rates and exhibit greater variation in ssu rDNA sequence length. Moreover, the order of the branches leading from the root of the eukaryotic tree to the base of the so-called "crown taxa" is consistent with a sequential attachment, due to "long-branch" effects, of sequences with increasing rates of evolution. These results suggest that the basal eurkaryotic topology drawn from rDNA analyses may be, in reality, an artifact of variation in the rate of molecular evolution among eukaryotic taxa. 相似文献
13.
Proposed molecular classifiers may be overfit to idiosyncrasies of noisy genomic and proteomic data. Cross-validation methods are often used to obtain estimates of classification accuracy, but both simulations and case studies suggest that, when inappropriate methods are used, bias may ensue. Bias can be bypassed and generalizability can be tested by external (independent) validation. We evaluated 35 studies that have reported on external validation of a molecular classifier. We extracted information on study design and methodological features, and compared the performance of molecular classifiers in internal cross-validation versus external validation for 28 studies where both had been performed. We demonstrate that the majority of studies pursued cross-validation practices that are likely to overestimate classifier performance. Most studies were markedly underpowered to detect a 20% decrease in sensitivity or specificity between internal cross-validation and external validation [median power was 36% (IQR, 21-61%) and 29% (IQR, 15-65%), respectively]. The median reported classification performance for sensitivity and specificity was 94% and 98%, respectively, in cross-validation and 88% and 81% for independent validation. The relative diagnostic odds ratio was 3.26 (95% CI 2.04-5.21) for cross-validation versus independent validation. Finally, we reviewed all studies (n = 758) which cited those in our study sample, and identified only one instance of additional subsequent independent validation of these classifiers. In conclusion, these results document that many cross-validation practices employed in the literature are potentially biased and genuine progress in this field will require adoption of routine external validation of molecular classifiers, preferably in much larger studies than in current practice. 相似文献
14.
Saddlepoint methods provide quick and easy approximations to significance levels for conditional tests of logistic regression parameters. We evaluate the accuracies of saddlepoint approximations for three well-known conditional tests: Bartlett's test for no three-factor interaction in a 2 x 2 x 2 table, the test for trend in a series of probabilities, and the exact test of no association in stratified 2 x 2 tables with a common odds ratio. General recommendations are suggested regarding the use of saddlepoint approximations for exact conditional significance levels. 相似文献
15.
The Human Development Index (HDI) based on life expectancy, education and per-capita income, is one of the most used indicators of human development. However, undeniable problems in data collection limit between-countries comparisons reducing the practical applicability of the HDI in official statistics. Elvidge et al. (2012) proposed an alternative index of human development (the so called Night Light Development Index, NLDI) derived from nighttime satellite imagery and population density, with improved comparability over time and space. The NLDI assesses inequality in the spatial distribution of night light among resident inhabitants and has proven to correlate with the HDI at the country scale. However, the NLDI presents some drawbacks, since similar NLDI values may indicate very different levels of human development. A modified NLDI overcoming such a drawback is proposed and applied to assessment of human development at 3 spatial scales (the entire country, 5 geographical divisions and 20 administrative regions) in Italy, a country with relevant territorial disparities in various socioeconomic dimensions. The original and modified NLDI were correlated with 5 independent indicators of economic growth, sustainable development and environmental quality. The spatial distribution of the original and modified NLDI is not coherent with the level of human development in Italy being indeed associated with various indexes of environmental quality. Further investigation is required to identify in which socioeconomic context (and at which spatial scale) the NDLI approach correctly estimates the level of human development in affluent countries. 相似文献
16.
A bottleneck in population size of a species is often correlated with a sharp reduction in genetic variation. The northern elephant seal (Mirounga angustirostris) has undergone at least one extreme bottleneck, having rebounded from 20-100 individuals a century ago to over 175,000 individuals today. The relative lack of molecular-genetic variation in contemporary populations has been documented, but the extent of variation before the late 19th century remains unknown. We have determined the nucleotide sequence of a 179 base-pair segment of the mitochondrial DNA (mtDNA) control region from seals that lived before, during and after a bottleneck low in 1892. A 'primerless' PCR was used to improve the recovery of information from older samples. Only two mtDNA genotypes were present in all 150+ seals from the 1892 bottleneck on, but we discovered four genotypes in five pre-bottleneck seals. This suggests a much greater amount of mtDNA genotypic variation before this bottleneck, and that the persistence of two genotypes today is a consequence of random lineage sampling. We cannot correlate the loss of mtDNA genotypes with a lowered mean fitness of individuals in the species today. However, we show that the species historically possessed additional genotypes to those present now, and that sampling of ancient DNA could elucidate the genetic consequences of severe reductions in population size. 相似文献
17.
This study describes natal attraction and infant handling in wild ursine colobus (Colobus vellerosus). Focal animal samples were collected from five infants of 1-16 weeks of age (mean: 14.5 focal hours per infant). Group members may be attracted to an infant, but unable to handle it because of resistance from the mother. We thus measured natal attraction independently from infant handling by the number of interactive approaches received. The youngest infants were most attractive. Immature females were attracted to and handled infants more than other group members. Mothers were tolerant of most handling attempts and infant-directed aggression was rare. A sixth infant was attacked by members of an all-male band, which allowed us to record the expression of natal attraction and infant handling in the context of an acute threat of infanticide. This infant was carried by non-mothers less frequently than the other infants, and its mother resisted handling attempts more often. 相似文献
18.
Lisa Patrick Bentley James C. Stegen Van M. Savage Duncan D. Smith Erica I. von Allmen John S. Sperry Peter B. Reich Brian J. Enquist 《Ecology letters》2013,16(8):1069-1078
Several theories predict whole‐tree function on the basis of allometric scaling relationships assumed to emerge from traits of branching networks. To test this key assumption, and more generally, to explore patterns of external architecture within and across trees, we measure branch traits (radii/lengths) and calculate scaling exponents from five functionally divergent species. Consistent with leading theories, including metabolic scaling theory, branching is area preserving and statistically self‐similar within trees. However, differences among scaling exponents calculated at node‐ and whole‐tree levels challenge the assumption of an optimised, symmetrically branching tree. Furthermore, scaling exponents estimated for branch length change across branching orders, and exponents for scaling metabolic rate with plant size (or number of terminal tips) significantly differ from theoretical predictions. These findings, along with variability in the scaling of branch radii being less than for branch lengths, suggest extending current scaling theories to include asymmetrical branching and differential selective pressures in plant architectures. 相似文献
19.
20.