首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Restriction site-associated DNA sequencing or genotyping-by-sequencing (GBS) approaches allow for rapid and cost-effective discovery and genotyping of thousands of single-nucleotide polymorphisms (SNPs) in multiple individuals. However, rigorous quality control practices are needed to avoid high levels of error and bias with these reduced representation methods. We developed a formal statistical framework for filtering spurious loci, using Mendelian inheritance patterns in nuclear families, that accommodates variable-quality genotype calls and missing data—both rampant issues with GBS data—and for identifying sex-linked SNPs. Simulations predict excellent performance of both the Mendelian filter and the sex-linkage assignment under a variety of conditions. We further evaluate our method by applying it to real GBS data and validating a subset of high-quality SNPs. These results demonstrate that our metric of Mendelian inheritance is a powerful quality filter for GBS loci that is complementary to standard coverage and Hardy–Weinberg filters. The described method, implemented in the software MendelChecker, will improve quality control during SNP discovery in nonmodel as well as model organisms.  相似文献   

2.
The dramatic increase in heterogeneous types of biological data—in particular, the abundance of new protein sequences—requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity—GPCRs and kinases from humans, and the crotonase superfamily of enzymes—we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.  相似文献   

3.
Bayesian inference (BI) of phylogenetic relationships uses the same probabilistic models of evolution as its precursor maximum likelihood (ML), so BI has generally been assumed to share ML''s desirable statistical properties, such as largely unbiased inference of topology given an accurate model and increasingly reliable inferences as the amount of data increases. Here we show that BI, unlike ML, is biased in favor of topologies that group long branches together, even when the true model and prior distributions of evolutionary parameters over a group of phylogenies are known. Using experimental simulation studies and numerical and mathematical analyses, we show that this bias becomes more severe as more data are analyzed, causing BI to infer an incorrect tree as the maximum a posteriori phylogeny with asymptotically high support as sequence length approaches infinity. BI''s long branch attraction bias is relatively weak when the true model is simple but becomes pronounced when sequence sites evolve heterogeneously, even when this complexity is incorporated in the model. This bias—which is apparent under both controlled simulation conditions and in analyses of empirical sequence data—also makes BI less efficient and less robust to the use of an incorrect evolutionary model than ML. Surprisingly, BI''s bias is caused by one of the method''s stated advantages—that it incorporates uncertainty about branch lengths by integrating over a distribution of possible values instead of estimating them from the data, as ML does. Our findings suggest that trees inferred using BI should be interpreted with caution and that ML may be a more reliable framework for modern phylogenetic analysis.  相似文献   

4.
Understanding the tradeoffs faced by organisms is a major goal of evolutionary biology. One of the main approaches for identifying these tradeoffs is Pareto task inference (ParTI). Two recent papers claim that results obtained in ParTI studies are spurious due to phylogenetic dependence (Mikami T, Iwasaki W. 2021. The flipping t-ratio test: phylogenetically informed assessment of the Pareto theory for phenotypic evolution. Methods Ecol Evol. 12(4):696–706) or hypothetical p-hacking and population-structure concerns (Sun M, Zhang J. 2021. Rampant false detection of adaptive phenotypic optimization by ParTI-based Pareto front inference. Mol Biol Evol. 38(4):1653–1664). Here, we show that these claims are baseless. We present a new method to control for phylogenetic dependence, called SibSwap, and show that published ParTI inference is robust to phylogenetic dependence. We show how researchers avoided p-hacking by testing for the robustness of preprocessing choices. We also provide new methods to control for population structure and detail the experimental tests of ParTI in systems ranging from ammonites to cancer gene expression. The methods presented here may help to improve future ParTI studies.  相似文献   

5.
Bermingham E  Avise JC 《Genetics》1986,113(4):939-965
Restriction fragment length polymorphisms in mitochondrial DNA (mtDNA) were used to reconstruct evolutionary relationships of conspecific populations in four species of freshwater fish—Amia calva, Lepomis punctatus, L. gulosus, and L. microlophus. A suite of 14-17 endonucleases was employed to assay mtDNAs from 305 specimens collected from 14 river drainages extending from South Carolina to Louisiana. Extensive mtDNA polymorphism was observed within each assayed species. In both phenograms and Wagner parsimony networks, mtDNA clones that were closely related genetically were usually geographically contiguous. Within each species, major mtDNA phylogenetic breaks also distinguished populations from separate geographic regions, demonstrating that dispersal and gene flow have not been sufficient to override geographic influences on population subdivision.—Importantly, there were strong patterns of congruence across species in the geographic placements of the mtDNA phylogenetic breaks. Three major boundary regions were characterized by concentrations of phylogenetic discontinuities, and these zones agree well with previously described zoogeographic boundaries identified by a different kind of data base—distributional limits of species—suggesting that a common set of historical factors may account for both phenomena. Repeated episodes of eustatic sea level change along a relatively static continental morphology are the likely causes of several patterns of drainage isolation and coalescence, and these are discussed in relation to the genetic data.—Overall, results exemplify the positive role that intraspecific genetic analyses may play in historical zoogeographic reconstruction. They also point out the potential inadequacies of any interpretations of population genetic structure that fail to consider the influences of history in shaping that structure.  相似文献   

6.
Theories based on optimal sampling by the retina have been widely applied to visual ecology at the level of the optics of the eye, supported by visual behaviour. This leads to speculation about the additional processing that must lie in between—in the brain itself. But fewer studies have adopted a quantitative approach to evaluating the detectability of specific features in these neural pathways. We briefly review this approach with a focus on contrast sensitivity of two parallel pathways for motion processing in insects, one used for analysis of wide-field optic flow, the other for detection of small features. We further use a combination of optical modelling of image blur and physiological recording from both photoreceptors and higher-order small target motion detector neurons sensitive to small targets to show that such neurons operate right at the limits imposed by the optics of the eye and the noise level of single photoreceptors. Despite this, and the limitation of only being able to use information from adjacent receptors to detect target motion, they achieve a contrast sensitivity that rivals that of wide-field motion sensitive pathways in either insects or vertebrates—among the highest in absolute terms seen in any animal.  相似文献   

7.
Previous phylogenetic studies in oaks (Quercus, Fagaceae) have failed to resolve the backbone topology of the genus with strong support. Here, we utilize next-generation sequencing of restriction-site associated DNA (RAD-Seq) to resolve a framework phylogeny of a predominantly American clade of oaks whose crown age is estimated at 23–33 million years old. Using a recently developed analytical pipeline for RAD-Seq phylogenetics, we created a concatenated matrix of 1.40 E06 aligned nucleotides, constituting 27,727 sequence clusters. RAD-Seq data were readily combined across runs, with no difference in phylogenetic placement between technical replicates, which overlapped by only 43–64% in locus coverage. 17% (4,715) of the loci we analyzed could be mapped with high confidence to one or more expressed sequence tags in NCBI Genbank. A concatenated matrix of the loci that BLAST to at least one EST sequence provides approximately half as many variable or parsimony-informative characters as equal-sized datasets from the non-EST loci. The EST-associated matrix is more complete (fewer missing loci) and has slightly lower homoplasy than non-EST subsampled matrices of the same size, but there is no difference in phylogenetic support or relative attribution of base substitutions to internal versus terminal branches of the phylogeny. We introduce a partitioned RAD visualization method (implemented in the R package RADami; http://cran.r-project.org/web/packages/RADami) to investigate the possibility that suboptimal topologies supported by large numbers of loci—due, for example, to reticulate evolution or lineage sorting—are masked by the globally optimal tree. We find no evidence for strongly-supported alternative topologies in our study, suggesting that the phylogeny we recover is a robust estimate of large-scale phylogenetic patterns in the American oak clade. Our study is one of the first to demonstrate the utility of RAD-Seq data for inferring phylogeny in a 23–33 million year-old clade.  相似文献   

8.
Our goal is to match some dynamical aspects of biological systems with that of networks of coupled logistic maps. With these networks we generate sequences of iterates, convert them to symbol sequences by coarse-graining, and count the number of times combinations of symbols occur. Comparison of this with the number of times these combinations occur in experimental data—a sequence of interbeat intervals for example—is a measure of the fitness of each network to describe the target data. The most fit networks provide a cartoon that suggests a decomposition of the experimental data into a component that may be produced by a simple dynamical subsystem, and a residual component, the result of detailed, particular characteristics of the system that generated the target data. In the space of all network parameters, each point corresponds to a particular network. We construct a fitness landscape when we assign a fitness to each point. Because the parameters are distributed continuously over their ranges, and because fitnesses are estimated numerically, any plot of the landscape involves a finite sample of parameter values. We’ll investigate how the local landscape geometry changes when the array of sample parameters is refined, and use the landscape geometry to explore complex relations between local fitness maxima.  相似文献   

9.
Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingman’s coalescent theory. Here, we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent susceptible–infected–removed (SIR) tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the performance of the two coalescent models and also juxtapose results obtained with a recently published birth–death-sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. Additionally, we analyze influenza A (H1N1) sequence data sampled in the Canterbury region of New Zealand and HIV-1 sequence data obtained from known United Kingdom infection clusters. We show that both coalescent SIR models are effective at estimating epidemiological parameters from data with large fundamental reproductive number R0 and large population size S0. Furthermore, we find that the stochastic variant generally outperforms its deterministic counterpart in terms of error, bias, and highest posterior density coverage, particularly for smaller R0 and S0. However, each of these inference models is shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with R0 close to one or with small effective susceptible populations.  相似文献   

10.
Genome-scale data have greatly facilitated the resolution of recalcitrant nodes that Sanger-based datasets have been unable to resolve. However, phylogenomic studies continue to use traditional methods such as bootstrapping to estimate branch support; and high bootstrap values are still interpreted as providing strong support for the correct topology. Furthermore, relatively little attention has been given to assessing discordances between gene and species trees, and the underlying processes that produce phylogenetic conflict. We generated novel genomic datasets to characterize and determine the causes of discordance in Old World treefrogs (Family: Rhacophoridae)—a group that is fraught with conflicting and poorly supported topologies among major clades. Additionally, a suite of data filtering strategies and analytical methods were applied to assess their impact on phylogenetic inference. We showed that incomplete lineage sorting was detected at all nodes that exhibited high levels of discordance. Those nodes were also associated with extremely short internal branches. We also clearly demonstrate that bootstrap values do not reflect uncertainty or confidence for the correct topology and, hence, should not be used as a measure of branch support in phylogenomic datasets. Overall, we showed that phylogenetic discordances in Old World treefrogs resulted from incomplete lineage sorting and that species tree inference can be improved using a multi-faceted, total-evidence approach, which uses the most amount of data and considers results from different analytical methods and datasets.  相似文献   

11.
While the proposal that large-scale genome expansions occurred early in vertebrate evolution is widely accepted, the exact mechanisms of the expansion—such as a single or multiple rounds of whole genome duplication, bloc chromosome duplications, large-scale individual gene duplications, or some combination of these—is unclear. Gene families with a single invertebrate member but four vertebrate members, such as the Hox clusters, provided early support for Ohno's hypothesis that two rounds of genome duplication (the 2R-model) occurred in the stem lineage of extant vertebrates. However, despite extensive study, the duplication history of the Hox clusters has remained unclear, calling into question its usefulness in resolving the role of large-scale gene or genome duplications in early vertebrates. Here, we present a phylogenetic analysis of the vertebrate Hox clusters and several linked genes (the Hox “paralogon”) and show that different phylogenies are obtained for Dlx and Col genes than for Hox and ErbB genes. We show that these results are robust to errors in phylogenetic inference and suggest that these competing phylogenies can be resolved if two chromosomal crossover events occurred in the ancestral vertebrate. These results resolve conflicting data on the order of Hox gene duplications and the role of genome duplication in vertebrate evolution and suggest that a period of genome reorganization occurred after genome duplications in early vertebrates.  相似文献   

12.
DNA barcoding was proposed in 2003, the Consortium for the Barcode of Life was established in 2004, and the movement has since attracted more than $80 million funding. Here we investigate how many species of multicellular animals have been barcoded. We compare the numbers in a public database (GenBank as of January 2012) with those in the Barcode of Life Database (BOLD) and find that GenBank contains COI (cytochrome c oxidase subunit 1) sequences for ca. 60 000 species while BOLD reports barcodes for ca. 150 000 species. The discrepancy is likely due to a large amount of unpublished data in BOLD. Overall, the species coverage remains sparse, growth rates are low, and the barcode accumulation curve for Metazoa is linear with only 4788 species having been added in 2011. In addition, the vast majority of species in the public database (73%) were barcoded by projects that are unlikely to be related to the DNA barcoding movement. Particularly surprising was the large number of DNA barcodes in GenBank that were not identified to species (Jan 2012: 74%), with insect barcodes often being identified only to order. Of these several hundred thousand have since been suppressed by NCBI because they did not satisfy the iBOL/GenBank early release agreement. Species coverage is considerably better for target taxa of DNA barcoding campaigns (e.g. birds, fishes, Lepidoptera), although it also falls short of published campaign targets. © The Willi Hennig Society 2012  相似文献   

13.
One-third of the world''s reef-building corals are facing heightened extinction risk from climate change and other anthropogenic impacts. Previous studies have shown that such threats are not distributed randomly across the coral tree of life, and future extinctions have the potential to disproportionately reduce the phylogenetic diversity of this group on a global scale. However, the impact of such losses on a regional scale remains poorly known. In this study, we use phylogenetic metrics in conjunction with geographical distributions of living reef coral species to model how extinctions are likely to affect evolutionary diversity across different ecoregions. Based on two measures—phylogenetic diversity and phylogenetic species variability—we highlight regions with the largest losses of evolutionary diversity and hence of potential conservation interest. Notably, the projected loss of evolutionary diversity is relatively low in the most species-rich areas such as the Coral Triangle, while many regions with fewer species stand to lose much larger shares of their diversity. We also suggest that for complex ecosystems like coral reefs it is important to consider changes in phylogenetic species variability; areas with disproportionate declines in this measure should be of concern even if phylogenetic diversity is not as impacted. These findings underscore the importance of integrating evolutionary history into conservation planning for safeguarding the future diversity of coral reefs.  相似文献   

14.
Detailed studies of individual genes have shown that gene expression divergence often results from adaptive evolution of regulatory sequence. Genome-wide analyses, however, have yet to unite patterns of gene expression with polymorphism and divergence to infer population genetic mechanisms underlying expression evolution. Here, we combined genomic expression data—analyzed in a phylogenetic context—with whole genome light-shotgun sequence data from six Drosophila simulans lines and reference sequences from D. melanogaster and D. yakuba. These data allowed us to use molecular population genetics to test for neutral versus adaptive gene expression divergence on a genomic scale. We identified recent and recurrent adaptive evolution along the D. simulans lineage by contrasting sequence polymorphism within D. simulans to divergence from D. melanogaster and D. yakuba. Genes that evolved higher levels of expression in D. simulans have experienced adaptive evolution of the associated 3′ flanking and amino acid sequence. Concomitantly, these genes are also decelerating in their rates of protein evolution, which is in agreement with the finding that highly expressed genes evolve slowly. Interestingly, adaptive evolution in 5′ cis-regulatory regions did not correspond strongly with expression evolution. Our results provide a genomic view of the intimate link between selection acting on a phenotype and associated genic evolution.  相似文献   

15.
Cancer occurs via an accumulation of somatic genomic alterations in a process of clonal evolution. There has been intensive study of potential causal mutations driving cancer development and progression. However, much recent evidence suggests that tumor evolution is normally driven by a variety of mechanisms of somatic hypermutability, which act in different combinations or degrees in different cancers. These variations in mutability phenotypes are predictive of progression outcomes independent of the specific mutations they have produced to date. Here we explore the question of how and to what degree these differences in mutational phenotypes act in a cancer to predict its future progression. We develop a computational paradigm using evolutionary tree inference (tumor phylogeny) algorithms to derive features quantifying single-tumor mutational phenotypes, followed by a machine learning framework to identify key features predictive of progression. Analyses of breast invasive carcinoma and lung carcinoma demonstrate that a large fraction of the risk of future clinical outcomes of cancer progression—overall survival and disease-free survival—can be explained solely from mutational phenotype features derived from the phylogenetic analysis. We further show that mutational phenotypes have additional predictive power even after accounting for traditional clinical and driver gene-centric genomic predictors of progression. These results confirm the importance of mutational phenotypes in contributing to cancer progression risk and suggest strategies for enhancing the predictive power of conventional clinical data or driver-centric biomarkers.  相似文献   

16.
Replicability, the ability to replicate scientific findings, is a prerequisite for scientific discovery and clinical utility. Troublingly, we are in the midst of a replicability crisis. A key to replicability is that multiple measurements of the same item (e.g., experimental sample or clinical participant) under fixed experimental constraints are relatively similar to one another. Thus, statistics that quantify the relative contributions of accidental deviations—such as measurement error—as compared to systematic deviations—such as individual differences—are critical. We demonstrate that existing replicability statistics, such as intra-class correlation coefficient and fingerprinting, fail to adequately differentiate between accidental and systematic deviations in very simple settings. We therefore propose a novel statistic, discriminability, which quantifies the degree to which an individual’s samples are relatively similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. Using this statistic, we introduce the possibility of optimizing experimental design via increasing discriminability and prove that optimizing discriminability improves performance bounds in subsequent inference tasks. In extensive simulated and real datasets (focusing on brain imaging and demonstrating on genomics), only optimizing data discriminability improves performance on all subsequent inference tasks for each dataset. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the replicability crisis, and more generally, mitigating accidental measurement error.  相似文献   

17.
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.  相似文献   

18.
Phylogenetic profiling, a network inference method based on gene inheritance profiles, has been widely used to construct functional gene networks in microbes. However, its utility for network inference in higher eukaryotes has been limited. An improved algorithm with an in-depth understanding of pathway evolution may overcome this limitation. In this study, we investigated the effects of taxonomic structures on co-inheritance analysis using 2,144 reference species in four query species: Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, and Homo sapiens. We observed three clusters of reference species based on a principal component analysis of the phylogenetic profiles, which correspond to the three domains of life—Archaea, Bacteria, and Eukaryota—suggesting that pathways inherit primarily within specific domains or lower-ranked taxonomic groups during speciation. Hence, the co-inheritance pattern within a taxonomic group may be eroded by confounding inheritance patterns from irrelevant taxonomic groups. We demonstrated that co-inheritance analysis within domains substantially improved network inference not only in microbe species but also in the higher eukaryotes, including humans. Although we observed two sub-domain clusters of reference species within Eukaryota, co-inheritance analysis within these sub-domain taxonomic groups only marginally improved network inference. Therefore, we conclude that co-inheritance analysis within domains is the optimal approach to network inference with the given reference species. The construction of a series of human gene networks with increasing sample sizes of the reference species for each domain revealed that the size of the high-accuracy networks increased as additional reference species genomes were included, suggesting that within-domain co-inheritance analysis will continue to expand human gene networks as genomes of additional species are sequenced. Taken together, we propose that co-inheritance analysis within the domains of life will greatly potentiate the use of the expected onslaught of sequenced genomes in the study of molecular pathways in higher eukaryotes.  相似文献   

19.
We present a rigorous statistical model that infers the structure of P. falciparum mixtures—including the number of strains present, their proportion within the samples, and the amount of unexplained mixture—using whole genome sequence (WGS) data. Applied to simulation data, artificial laboratory mixtures, and field samples, the model provides reasonable inference with as few as 10 reads or 50 SNPs and works efficiently even with much larger data sets. Source code and example data for the model are provided in an open source fashion. We discuss the possible uses of this model as a window into within-host selection for clinical and epidemiological studies.  相似文献   

20.
Chaix R  Somel M  Kreil DP  Khaitovich P  Lunter GA 《Genetics》2008,180(3):1379-1389
Changes in gene expression play an important role in species' evolution. Earlier studies uncovered evidence that the effect of mutations on expression levels within the primate order is skewed, with many small downregulations balanced by fewer but larger upregulations. In addition, brain-expressed genes appeared to show an increased rate of evolution on the branch leading to human. However, the lack of a mathematical model adequately describing the evolution of gene expression precluded the rigorous establishment of these observations. Here, we develop mathematical tools that allow us to revisit these earlier observations in a model-testing and inference framework. We introduce a model for skewed gene-expression evolution within a phylogenetic tree and use a separate model to account for biological or experimental outliers. A Bayesian Markov chain Monte Carlo inference procedure allows us to infer the phylogeny and other evolutionary parameters, while quantifying the confidence in these inferences. Our results support previous observations; in particular, we find strong evidence for a sustained positive skew in the distribution of gene-expression changes in primate evolution. We propose a “corrective sweep” scenario to explain this phenomenon.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号