首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We prove that it is impossible to reconstruct ancestral data at the root of "deep" phylogenetic trees with high mutation rates. Moreover, we prove that it is impossible to reconstruct the topology of "deep" trees with high mutation rates from a number of characters smaller than a low-degree polynomial in the number of leaves. Our impossibility results hold for all reconstruction methods. The proofs apply tools from information theory and percolation theory.  相似文献   

2.
Cancer has long been understood as a somatic evolutionary process, but many details of tumor progression remain elusive. Here, we present BitPhylogeny, a probabilistic framework to reconstruct intra-tumor evolutionary pathways. Using a full Bayesian approach, we jointly estimate the number and composition of clones in the sample as well as the most likely tree connecting them. We validate our approach in the controlled setting of a simulation study and compare it against several competing methods. In two case studies, we demonstrate how BitPhylogeny reconstructs tumor phylogenies from methylation patterns in colon cancer and from single-cell exomes in myeloproliferative neoplasm.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0592-6) contains supplementary material, which is available to authorized users.  相似文献   

3.
Recent development in DNA microarray technologies has made the reconstruction of gene regulatory networks (GRNs) feasible. To infer the overall structure of a GRN, there is a need to find out how the expression of each gene can be affected by the others. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory experiments can be performed afterwards for verification. Since, they are not intended to be used to predict if a gene in an unseen sample has any interactions with other genes, statistical verification of the reliability of the discovered interactions can be difficult. Furthermore, since the temporal ordering of the data is not taken into consideration, the directionality of regulation cannot be established using these existing techniques. To tackle these problems, we propose a data mining technique here. This technique makes use of a probabilistic inference approach to uncover interesting dependency relationships in noisy, high-dimensional time series expression data. It is not only able to determine if a gene is dependent on another but also whether or not it is activated or inhibited. In addition, it can predict how a gene would be affected by other genes even in unseen samples. For performance evaluation, the proposed technique has been tested with real expression data. Experimental results show that it can be very effective. The discovered dependency relationships can reveal gene regulatory relationships that could be used to infer the structures of GRNs.  相似文献   

4.

Background  

In recent years, gene order data has attracted increasing attention from both biologists and computer scientists as a new type of data for phylogenetic analysis. If gene orders are viewed as one character with a large number of states, traditional bootstrap procedures cannot be applied. Researchers began to use a jackknife resampling method to assess the quality of gene order phylogenies.  相似文献   

5.
Rubin BE  Ree RH  Moreau CS 《PloS one》2012,7(4):e33394
Reduced-representation genome sequencing represents a new source of data for systematics, and its potential utility in interspecific phylogeny reconstruction has not yet been explored. One approach that seems especially promising is the use of inexpensive short-read technologies (e.g., Illumina, SOLiD) to sequence restriction-site associated DNA (RAD)--the regions of the genome that flank the recognition sites of restriction enzymes. In this study, we simulated the collection of RAD sequences from sequenced genomes of different taxa (Drosophila, mammals, and yeasts) and developed a proof-of-concept workflow to test whether informative data could be extracted and used to accurately reconstruct "known" phylogenies of species within each group. The workflow consists of three basic steps: first, sequences are clustered by similarity to estimate orthology; second, clusters are filtered by taxonomic coverage; and third, they are aligned and concatenated for "total evidence" phylogenetic analysis. We evaluated the performance of clustering and filtering parameters by comparing the resulting topologies with well-supported reference trees and we were able to identify conditions under which the reference tree was inferred with high support. For Drosophila, whole genome alignments allowed us to directly evaluate which parameters most consistently recovered orthologous sequences. For the parameter ranges explored, we recovered the best results at the low ends of sequence similarity and taxonomic representation of loci; these generated the largest supermatrices with the highest proportion of missing data. Applications of the method to mammals and yeasts were less successful, which we suggest may be due partly to their much deeper evolutionary divergence times compared to Drosophila (crown ages of approximately 100 and 300 versus 60 Mya, respectively). RAD sequences thus appear to hold promise for reconstructing phylogenetic relationships in younger clades in which sufficient numbers of orthologous restriction sites are retained across species.  相似文献   

6.
ABSTRACT: Large-scale sequencing of genomes has enabled the inference of phylogenies based on the evolution of genomic architecture, under such events as rearrangements, duplications, and losses. Many evolutionary models and associated algorithms have been designed over the last few years and have found use in comparative genomics and phylogenetic inference. However, the assessment of phylogenies built from such data has not been properly addressed to date. The standard method used in sequence-based phylogenetic inference is the bootstrap, but it relies on a large number of homologous characters that can be resampled; yet in the case of rearrangements, the entire genome is a single character. Alternatives such as the jackknife suffer from the same problem, while likelihood tests cannot be applied in the absence of well established probabilistic models. We present a new approach to the assessment of distance-based phylogenetic inference from whole-genome data; our approach combines features of the jackknife and the bootstrap and remains nonparametric. For each feature of our method, we give an equivalent feature in the sequence-based framework; we also present the results of extensive experimental testing, in both sequence-based and genome-based frameworks. Through the feature-by-feature comparison and the experimental results, we show that our bootstrapping approach is on par with the classic phylogenetic bootstrap used in sequence-based reconstruction, and we establish the clear superiority of the classic bootstrap for sequence data and of our corresponding new approach for rearrangement data over proposed variants. Finally, we test our approach on a small dataset of mammalian genomes, verifying that the support values match current thinking about the respective branches. Our method is the first to provide a standard of assessment to match that of the classic phylogenetic bootstrap for aligned sequences. Its support values follow a similar scale and its receiver-operating characteristics are nearly identical, indicating that it provides similar levels of sensitivity and specificity. Thus our assessment method makes it possible to conduct phylogenetic analyses on whole genomes with the same degree of confidence as for analyses on aligned sequences. Extensions to search-based inference methods such as maximum parsimony and maximum likelihood are possible, but remain to be thoroughly tested.  相似文献   

7.

Background  

Gene trees that arise in the context of reconstructing the evolutionary history of polyploid species are often multiply-labeled, that is, the same leaf label can occur several times in a single tree. This property considerably complicates the task of forming a consensus of a collection of such trees compared to usual phylogenetic trees.  相似文献   

8.

Background  

In eukaryotic genomes, most genes are members of gene families. When comparing genes from two species, therefore, most genes in one species will be homologous to multiple genes in the second. This often makes it difficult to distinguish orthologs (separated through speciation) from paralogs (separated by other types of gene duplication). Combining phylogenetic relationships and genomic position in both genomes helps to distinguish between these scenarios. This kind of comparison can also help to describe how gene families have evolved within a single genome that has undergone polyploidy or other large-scale duplications, as in the case of Arabidopsis thaliana – and probably most plant genomes.  相似文献   

9.

Background  

Understanding the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. Reconstructing phylogenies for large data sets is still a challenging task in Bioinformatics.  相似文献   

10.
Analysis of recursive gene selection approaches from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Finding a small subset of most predictive genes from microarray for disease prediction is a challenging problem. Support vector machines (SVMs) have been found to be successful with a recursive procedure in selecting important genes for cancer prediction. However, it is not well understood how much of the success depends on the choice of the specific classifier and how much on the recursive procedure. We answer this question by examining multiple classifers [SVM, ridge regression (RR) and Rocchio] with feature selection in recursive and non-recursive settings on three DNA microarray datasets (ALL-AML Leukemia data, Breast Cancer data and GCM data). RESULTS: We found recursive RR most effective. On the AML-ALL dataset, it achieved zero error rate on the test set using only three genes (selected from over 7000), which is more encouraging than the best published result (zero error rate using 8 genes by recursive SVM). On the Breast Cancer dataset and the two largest categories of the GCM dataset, the results achieved by recursive RR are also very encouraging. A further analysis of the experimental results shows that different classifiers penalize redundant features to different extent and this property plays an important role in the recursive feature selection process. RR classifier tends to penalize redundant features to a much larger extent than the SVM does. This may be the reason why recursive RR has a better performance in selecting genes.  相似文献   

11.
In systematics, parsimony methods construct phylogenies, or evolutionary trees, in which characters evolve with the least evolutionary change. The chromosome inversion, or polymorphism, parsimony criterion is used when each character of a population may exhibit homozygous or heterozygous states, but when the heterozygous state must evolve uniquely. Variations of the criterion concern whether or not the ancestral states of characters are specified. We establish that problems of inferring phylogenies by these criteria are NP-complete and thus are so difficult computationally that efficient optimal algorithms for them are unlikely to exist.  相似文献   

12.
A statistical test of phylogenies estimated from sequence data   总被引:4,自引:0,他引:4  
A simple approach to testing the significance of the branching order, estimated from protein or DNA sequence data, of three taxa is proposed. The branching order is inferred by the transformed-distance method, under the assumption that one or two outgroups are available, and the branch lengths are estimated by the least-squares method. The inferred branching order is considered significant if the estimated internodal distance is significantly greater than zero. To test this, a formula for the variance of the internodal distance has been developed. The statistical test proposed has been checked by computer simulation. The same test also applies to the case of four taxa with no outgroup, if one considers an unrooted tree. Formulas for the variances of internodal distances have also been developed for the case of five taxa. Conditions are given under which it is more efficient to add the sequence of a fifth taxon than to do 25% more nucleotide sequencing in each of the original four. A method is presented for combining analyses of disparate data to get a single P value. Finally, the test, applied to the human-chimpanzee-gorilla problem, shows that the issue is not yet resolved.  相似文献   

13.
A simple graphic method is proposed for reconstructing phylogenetic trees from molecular data. This method is similar to the unweighted pair-group method with arithmetic mean, but the process of computation of average distances and reconstruction of new matrices, required in the latter method, is eliminated from this new method, so that one can reconstruct a phylogenetic tree without using a computer, unless the number of operational taxonomic units is very large. Furthermore, this method allows a phylogenetic tree to have multifurcating branches whenever there is ambiguity with bifurcation.  相似文献   

14.
The knowledge of potential impacts of climate change on terrestrial vegetation is crucial to understand long-term global carbon cycle development. Discrepancy in data has long existed between past carbon storage reconstructions since the Last Glacial Maximum by way of pollen, carbon isotopes, and general circulation model (GCM) analysis. This may be due to the fact that these methods do not synthetically take into account significant differences in climate distribution between modern and past conditions, as well as the effects of atmospheric CO2 concentrations on vegetation. In this study, a new method to estimate past biospheric carbon stocks is reported, utilizing a new integrated ecosystem model (PCM) built on a physiological process vegetation model (BIOME4) coupled with a process-based biospheric carbon model (DEMETER). The PCM was constrained to fit pollen data to obtain realistic estimates. It was estimated that the probability distribution of climatic parameters, as simulated by BIOME4 in an inverse process, was compatible with pollen data while DEMETER successfully simulated carbon storage values with corresponding outputs of BIOME4. The carbon model was validated with present-day observations of vegetation biomes and soil carbon, and the inversion scheme was tested against 1491 surface pollen spectra sample sites procured in Africa and Eurasia. Results show that this method can successfully simulate biomes and related climates at most selected pollen sites, providing a coefficient of determination ( R ) of 0.83–0.97 between the observed and reconstructed climates, while also showing a consensus with an R -value of 0.90–0.96 between the simulated biome average terrestrial carbon variables and the available observations. The results demonstrate the reliability and feasibility of the climate reconstruction method and its potential efficiency in reconstructing past terrestrial carbon storage.  相似文献   

15.
16.
Since the first animal genomes were completely sequenced ten years ago, evolutionary biologists have attempted to use the encoded information to reconstruct different aspects of the earliest stages of animal evolution. One of the most important uses of genome sequences is to understand relationships between animal phyla. Despite the wealth of data available, ranging from primary sequence data to gene and genome structures, our lack of understanding of the modes of evolution of genomic characters means that using these data is fraught with potential difficulties, leading to errors in phylogeny reconstruction. Improved understanding of how different character types evolve, the use of this knowledge to develop more accurate models of evolution, and denser taxonomic sampling, are now minimizing the sources of error. The wealth of genomic data now being produced promises that a well-resolved tree of the animal phyla will be available in the near future.  相似文献   

17.
In recent years, the problem of reconstructing the connectivity in large neural circuits ("connectomics") has re-emerged as one of the main objectives of neuroscience. Classically, reconstructions of neural connectivity have been approached anatomically, using electron or light microscopy and histological tracing methods. This paper describes a statistical approach for connectivity reconstruction that relies on relatively easy-to-obtain measurements using fluorescent probes such as synaptic markers, cytoplasmic dyes, transsynaptic tracers, or activity-dependent dyes. We describe the possible design of these experiments and develop a Bayesian framework for extracting synaptic neural connectivity from such data. We show that the statistical reconstruction problem can be formulated naturally as a tractable L (1)-regularized quadratic optimization. As a concrete example, we consider a realistic hypothetical connectivity reconstruction experiment in C. elegans, a popular neuroscience model where a complete wiring diagram has been previously obtained based on long-term electron microscopy work. We show that the new statistical approach could lead to an orders of magnitude reduction in experimental effort in reconstructing the connectivity in this circuit. We further demonstrate that the spatial heterogeneity and biological variability in the connectivity matrix-not just the "average" connectivity-can also be estimated using the same method.  相似文献   

18.
19.
SUMMARY: ORIOGEN is a user-friendly Java-based software package for selecting and clustering genes according to their profiles across various treatment groups. In particular, ORIOGEN is useful for analyzing data obtained from time-course or dose-response type experiments. AVAILABILITY: The ORIOGEN software can be downloaded freely from http://dir.niehs.nih.gov/dirbb/oriogen/index.cfm CONTACT: peddada@niehs.nih.gov (for statistical questions) and oriogen@constellagroup.com (for software support) SUPPLEMENTARY INFORMATION: ORIOGEN has a full set of help files. Also, an example input file is provided with the download.  相似文献   

20.
Allozyme data are widely used to infer the phylogenies of populations and closely-related species. Numerous parsimony, distance, and likelihood methods have been proposed for phylogenetic analysis of these data; the relative merits of these methods have been debated vigorously, but their accuracy has not been well explored. In this study, I compare the performance of 13 phylogenetic methods (six parsimony, six distance, and continuous maximum likelihood) by applying a congruence approach to eight allozyme data sets from the literature. Clades are identified that are supported by multiple data sets other than allozymes (e.g. morphology, DNA sequences), and the ability of different methods to recover these 'known' clades is compared. The results suggest that (1) distance and likelihood methods generally outperform parsimony methods, (2) methods that utilize frequency data tend to perform well, and (3) continuous maximum likelihood is among the most accurate methods, and appears to be robust to violations of its assumptions. These results are in agreement with those from recent simulation studies, and help provide a basis for empirical workers to choose among the many methods available for analysing allozyme characters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号