首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 8 毫秒
1.
The problem of resolving genotypes into haplotypes, under the perfect phylogeny model, has been under intensive study recently. All studies so far handled missing data entries in a heuristic manner. We prove that the perfect phylogeny haplotype problem is NP-complete when some of the data entries are missing, even when the phylogeny is rooted. We define a biologically motivated probabilistic model for genotype generation and for the way missing data occur. Under this model, we provide an algorithm, which takes an expected polynomial time. In tests on simulated data, our algorithm quickly resolves the genotypes under high rates of missing entries.  相似文献   

2.
Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002 (Gusfield, 2002), the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper, we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data structure and simple operations on it. The method is straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than prior nonlinear methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the inner loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly.  相似文献   

3.
The multistate perfect phylogeny problem is a classic problem in computational biology. When no perfect phylogeny exists, it is of interest to find a set of characters to remove in order to obtain a perfect phylogeny in the remaining data. This is known as the character removal problem. We show how to use chordal graphs and triangulations to solve the character removal problem for an arbitrary number of states, which was previously unsolved. We outline a preprocessing technique that speeds up the computation of the minimal separators of a graph. Minimal separators are used in our solution to the missing data character removal problem and to Gusfield's solution of the perfect phylogeny problem with missing data.  相似文献   

4.
It is now well known that incomplete lineage sorting can cause serious difficulties for phylogenetic inference, but little attention has been paid to methods that attempt to overcome these difficulties by explicitly considering the processes that produce them. Here we explore approaches to phylogenetic inference designed to consider retention and sorting of ancestral polymorphism. We examine how the reconstructability of a species (or population) phylogeny is affected by (a) the number of loci used to estimate the phylogeny and (b) the number of individuals sampled per species. Even in difficult cases with considerable incomplete lineage sorting (times between divergences less than 1 N(e) generations), we found the reconstructed species trees matched the "true" species trees in at least three out of five partitions, as long as a reasonable number of individuals per species were sampled. We also studied the tradeoff between sampling more loci versus more individuals. Although increasing the number of loci gives more accurate trees for a given sampling effort with deeper species trees (e.g., total depth of 10 N(e) generations), sampling more individuals often gives better results than sampling more loci with shallower species trees (e.g., depth = 1 N(e)). Taken together, these results demonstrate that gene sequences retain enough signal to achieve an accurate estimate of phylogeny despite widespread incomplete lineage sorting. Continued improvement in our methods to reconstruct phylogeny near the species level will require a shift to a compound model that considers not only nucleotide or character state substitutions, but also the population genetics processes of lineage sorting. [Coalescence; divergence; population; speciation.].  相似文献   

5.
Bayesian logistic regression using a perfect phylogeny   总被引:1,自引:0,他引:1  
Haplotype data capture the genetic variation among individuals in a population and among populations. An understanding of this variation and the ancestral history of haplotypes is important in genetic association studies of complex disease. We introduce a method for detecting associations between disease and haplotypes in a candidate gene region or candidate block with little or no recombination. A perfect phylogeny demonstrates the evolutionary relationship between single-nucleotide polymorphisms (SNPs) in the haplotype blocks. Our approach extends the logic regression technique of Ruczinski and others (2003) to a Bayesian framework, and constrains the model space to that of a perfect phylogeny. Environmental factors, as well as their interactions with SNPs, may be incorporated into the regression framework. We demonstrate our method on simulated data from a coalescent model, as well as data from a candidate gene study of sarcoidosis.  相似文献   

6.
Inferring haplotype data from genotype data is a crucial step in linking SNPs to human diseases. Given n genotypes over m SNP sites, the haplotype inference (HI) problem deals with finding a set of haplotypes so that each given genotype can be formed by a combining a pair of haplotypes from the set. The perfect phylogeny haplotyping (PPH) problem is one of the many computational approaches to the HI problem. Though it was conjectured that the complexity of the PPH problem was O(nm), the complexity of all the solutions presented until recently was O(nm (2)). In this paper, we make complete use of the column-ordering that was presented earlier and show that there must be some interdependencies among the pairwise relationships between SNP sites in order for the given genotypes to allow a perfect phylogeny. Based on these interdependencies, we introduce the FlexTree (flexible tree) data structure that represents all the pairwise relationships in O(m) space. The FlexTree data structure provides a compact representation of all the perfect phylogenies for the given set of genotypes. We also introduce an ordering of the genotypes that allows the genotypes to be added to the FlexTree sequentially. The column ordering, the FlexTree data structure, and the row ordering we introduce make the O(nm) OPPH algorithm possible. We present some results on simulated data which demonstrate that the OPPH algorithm performs quiet impressively when compared to the previous algorithms. The OPPH algorithm is one of the first O(nm) algorithms presented for the PPH problem.  相似文献   

7.
Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person's genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver (http://www.cs.columbia.edu/compbio/hap/) is publicly available for predicting haplotypes from genotype data and partitioning genotype data into blocks.  相似文献   

8.
We considered the contribution of two mitochondrial and two nuclear data sets for the phylogenetic reconstruction of 22 species of seed beetles in the genus Curculio (Coleoptera: Cuculionidae). A phylogenetic tree from representatives found on various hosts was inferred from a combined data set of mitochondrial DNA cytochrome oxidase subunit I, mitochondrial cytochrome b, nuclear elongation factor 1alpha, and nuclear phosphoglycerate mutase, used for the first time as a molecular marker. Separate parsimony analyses of each data set showed that individual gene trees were mainly congruent and often complementary in the support of clades but the analysis was complicated by failure of PCR amplification of nuclear genes for many taxa and hence missing data entries. When the four gene partitions were combined in a simultaneous analysis despite the missing data, this increased the resolution and taxonomic coverage compared to the individual source trees. Alternative approaches of combining the information via supertree methodology produced a comparatively less resolved tree, and hence seem inferior to combining data matrices even in cases where numerous taxa are missing. The molecular data suggest a classification of the European species into two species groups that are in accordance with morphological characteristics but the data do no support any of the previously recognised American species groups.  相似文献   

9.
Haplotyping as perfect phylogeny: a direct approach.   总被引:4,自引:0,他引:4  
A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.  相似文献   

10.
The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct theta(nm2) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003) be sped-up to O(nm + m2) time, which would imply an O(nm) time-bound for the PPH problem? Second, if there are multiple solutions, can we find one that is most parsimonious in terms of the number of distinct haplotypes. We give reductions that suggests that the answer to both questions is "no." For the first problem, we show that computing the output of the first step (in either method) is equivalent to Boolean matrix multiplication. Therefore, the best bound we can presently achieve is O(nm(omega-1)), where omega < or = 2.52 is the exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely requires a different approach. For the second problem of computing a PPH solution that minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979).  相似文献   

11.
He Y  Li C  Amos CI  Xiong M  Ling H  Jin L 《PloS one》2011,6(7):e22097
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions.  相似文献   

12.
A haplotype is an m-long binary vector. The XOR-genotype of two haplotypes is the m-vector of their coordinate-wise XOR. We study the following problem: Given a set of XOR-genotypes, reconstruct their haplotypes so that the set of resulting haplotypes can be mapped onto a perfect phylogeny (PP) tree. The question is motivated by studying population evolution in human genetics, and is a variant of the perfect phylogeny haplotyping problem that has received intensive attention recently. Unlike the latter problem, in which the input is "full" genotypes, here we assume less informative input, and so may be more economical to obtain experimentally. Building on ideas of Gusfield, we show how to solve the problem in polynomial time, by a reduction to the graph realization problem. The actual haplotypes are not uniquely determined by that tree they map onto, and the tree itself may or may not be unique. We show that tree uniqueness implies uniquely determined haplotypes, up to inherent degrees of freedom, and give a sufficient condition for the uniqueness. To actually determine the haplotypes given the tree, additional information is necessary. We show that two or three full genotypes suffice to reconstruct all the haplotypes, and present a linear algorithm for identifying those genotypes.  相似文献   

13.
As a “base line” of memorization performance, the behavior of a “perfect learner” is considered. He is characterized by a perfect memory and by the ability to choose the best search procedure in problems where the correct response from a given repertoire is to be found to each of several stimuli under the condition of “right” and “wroing” promptings by the experimenter. Expected learning curves are derived for the case of disjoint response repertoires associated with the stimuli under cyclic and random presentation of the stimuli and for the case of a single response repertoire (a one-to-one matching problem) under cyclic presentation.  相似文献   

14.
Data incongruence and the problem of avian louse phylogeny   总被引:2,自引:0,他引:2  
Smith, V. S., Page, R. D. M. & Johnson, K. P. (2004). Data incongruence and the problem of avian louse phylogeny. — Zoologica Scripta, 33 , 239 –259.
Recent studies based on different types of data (i.e. morphological and molecular) have supported conflicting phylogenies for the genera of avian feather lice (Ischnocera: Phthiraptera). We analyse new and published data from morphology and from mitochondrial (12S rRNA and COI) and nuclear (EF1-α) genes to explore the sources of this incongruence and explain these conflicts. Character convergence, multiple substitutions at high divergences, and ancient radiation over a short period of time have contributed to the problem of resolving louse phylogeny with the data currently available. We show that apparent incongruence between the molecular datasets is largely attributable to rate variation and nonstationarity of base composition. In contrast, highly significant character incongruence leads to topological incongruence between the molecular and morphological data. We consider ways in which biases in the sequence data could be misleading, using several maximum likelihood models and LogDet corrections. The hierarchical structure of the data is explored using likelihood mapping and SplitsTree methods. Ultimately, we concede there is strong discordance between the molecular and morphological data and apply the conditional combination approach in this case. We conclude that higher level phylogenetic relationships within avian Ischnocera remain extremely problematic. However, consensus between datasets is beginning to converge on a stable phylogeny for avian lice, at and below the familial rank.  相似文献   

15.
The problem Parsimony Haplotyping (PH) asks for the smallest set of haplotypes which can explain a given set of genotypes, and the problem Minimum Perfect Phylogeny Haplotyping (MPPH) asks for the smallest such set which also allows the haplotypes to be embedded in a perfect phylogeny, an evolutionary tree with biologically-motivated restrictions. For PH, we extend recent work by further mapping the interface between ;;easy' and ;;hard' instances, within the framework of (k,l)-bounded instances where the number of 2's per column and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of MPPH we provide the first concrete, positive results for this problem. In addition, we construct for both PH and MPPH polynomial time approximation algorithms, based on properties of the columns of the input matrix.  相似文献   

16.
17.
18.
Estimating phylogenetic relationships among closely related species can be extremely difficult when there is incongruence among gene trees and between the gene trees and the species tree. Here we show that incorporating a model of the stochastic loss of gene lineages by genetic drift into the phylogenetic estimation procedure can provide a robust estimate of species relationships, despite widespread incomplete sorting of ancestral polymorphism. This approach is applied to a group of montane Melanoplus grasshoppers for which genealogical discordance among loci and incomplete lineage sorting obscures any obvious phylogenetic relationships among species. Unlike traditional treatments where gene trees estimated using standard phylogenetic methods are implicitly equated with the species tree, with the coalescent-based approach the species tree is modeled probabilistically from the estimated gene trees. The estimated species phylogeny (the ESP) is calculated for the grasshoppers from multiple gene trees reconstructed for nuclear loci and a mitochondrial gene. This empirical application is coupled with a simulation study to explore the performance of the coalescent-based approach. Specifically, we test the accuracy of the ESP given the data based on analyses of simulated data matching the multilocus data collected in Melanoplus (i.e., data were simulated for each locus with the same number of base pairs and locus-specific mutational models). The results of the study show that ESPs can be computed using the coalescent-based approach long before reciprocal monophyly has been achieved, and that these statistical estimates are accurate. This contrasts with analyses of the empirical data collected in Melanoplus and simulated data based on concatenation of multiple loci, for which the incomplete lineage sorting of recently diverged species posed significant problems. The strengths and potential challenges associated with incorporating an explicit model of gene-lineage coalescence into the phylogenetic procedure to obtain an ESP, as illustrated by application to Melanoplus, versus concatenation and consensus approaches are discussed. This study represents a fundamental shift in how species relationships are estimated - the relationship between the gene trees and the species phylogeny is modeled probabilistically rather than equating gene trees with a species tree.  相似文献   

19.
Molecular data are widely used to reconstruct phylogenetic relationships among species, and these phylogenies are often used as the basis for inferences about the history of evolutionary change in other nonmolecular characters. This approach is an appropriate and powerful one in many circumstances. But when several lineages diverge over a relatively short period of time, the assumption that a molecular (gene) tree will always be a valid basis for such inferences may not hold. Empirical evidence from humans, nonhuman primates, and other mammals indicates that the relationships among molecular divergence, morphological differentiation, and the origin of reproductive isolation between diverging lineages are complex. The simple dichotomously branching trees that result from molecular systematic studies of Homo, Gorilla, and Pan may be a misleading basis for reconstructions of evolutionary change in nonmolecular characters. © 1994 Wiley-Liss, Inc.  相似文献   

20.
How much horizontal gene transfer (HGT) between species influences bacterial phylogenomics is a controversial issue. This debate, however, lacks any quantitative assessment of the impact of HGT on phylogenies and of the ability of tree-building methods to cope with such events. I introduce a Markov model of genome evolution with HGT, accounting for the constraints on time -- an HGT event can only occur between concomitantly living species. This model is used to simulate multigene sequence data sets with or without HGT. The consequences of HGT on phylogenomic inference are analyzed and compared to other well-known phylogenetic artefacts. It is found that supertree methods are quite robust to HGT, keeping high levels of performance even when gene trees are largely incongruent with each other. Gene tree incongruence per se is not indicative of HGT. HGT, however, removes the (otherwise observed) positive relationship between sequence length and gene tree congruence to the estimated species tree. Surprisingly, when applied to a bacterial and a eukaryotic multigene data set, this criterion rejects the HGT hypothesis for the former, but not the latter data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号