首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Inferring haplotype data from genotype data is a crucial step in linking SNPs to human diseases. Given n genotypes over m SNP sites, the haplotype inference (HI) problem deals with finding a set of haplotypes so that each given genotype can be formed by a combining a pair of haplotypes from the set. The perfect phylogeny haplotyping (PPH) problem is one of the many computational approaches to the HI problem. Though it was conjectured that the complexity of the PPH problem was O(nm), the complexity of all the solutions presented until recently was O(nm (2)). In this paper, we make complete use of the column-ordering that was presented earlier and show that there must be some interdependencies among the pairwise relationships between SNP sites in order for the given genotypes to allow a perfect phylogeny. Based on these interdependencies, we introduce the FlexTree (flexible tree) data structure that represents all the pairwise relationships in O(m) space. The FlexTree data structure provides a compact representation of all the perfect phylogenies for the given set of genotypes. We also introduce an ordering of the genotypes that allows the genotypes to be added to the FlexTree sequentially. The column ordering, the FlexTree data structure, and the row ordering we introduce make the O(nm) OPPH algorithm possible. We present some results on simulated data which demonstrate that the OPPH algorithm performs quiet impressively when compared to the previous algorithms. The OPPH algorithm is one of the first O(nm) algorithms presented for the PPH problem.  相似文献   

2.
A haplotype is an m-long binary vector. The XOR-genotype of two haplotypes is the m-vector of their coordinate-wise XOR. We study the following problem: Given a set of XOR-genotypes, reconstruct their haplotypes so that the set of resulting haplotypes can be mapped onto a perfect phylogeny (PP) tree. The question is motivated by studying population evolution in human genetics, and is a variant of the perfect phylogeny haplotyping problem that has received intensive attention recently. Unlike the latter problem, in which the input is "full" genotypes, here we assume less informative input, and so may be more economical to obtain experimentally. Building on ideas of Gusfield, we show how to solve the problem in polynomial time, by a reduction to the graph realization problem. The actual haplotypes are not uniquely determined by that tree they map onto, and the tree itself may or may not be unique. We show that tree uniqueness implies uniquely determined haplotypes, up to inherent degrees of freedom, and give a sufficient condition for the uniqueness. To actually determine the haplotypes given the tree, additional information is necessary. We show that two or three full genotypes suffice to reconstruct all the haplotypes, and present a linear algorithm for identifying those genotypes.  相似文献   

3.
The incomplete perfect phylogeny (IPP) problem and the incomplete perfect phylogeny haplotyping (IPPH) problem deal with constructing a phylogeny for a given set of haplotypes or genotypes with missing entries. The earlier approaches for both of these problems dealt with restricted versions of the problems, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. In this paper, we deal with the unrestricted versions of the problems, where the root of the phylogeny is neither available nor trivially recoverable from the data. Both IPP and IPPH problems have previously been proven to be NPcomplete. Here, we present efficient enumerative algorithms that can handle practical instances of the problem. Empirical analysis on simulated data shows that the algorithms perform very well both in terms of speed and in terms accuracy of the recovered data.  相似文献   

4.
The problem of resolving genotypes into haplotypes, under the perfect phylogeny model, has been under intensive study recently. All studies so far handled missing data entries in a heuristic manner. We prove that the perfect phylogeny haplotype problem is NP-complete when some of the data entries are missing, even when the phylogeny is rooted. We define a biologically motivated probabilistic model for genotype generation and for the way missing data occur. Under this model, we provide an algorithm, which takes an expected polynomial time. In tests on simulated data, our algorithm quickly resolves the genotypes under high rates of missing entries.  相似文献   

5.
The problem Parsimony Haplotyping (PH) asks for the smallest set of haplotypes which can explain a given set of genotypes, and the problem Minimum Perfect Phylogeny Haplotyping (MPPH) asks for the smallest such set which also allows the haplotypes to be embedded in a perfect phylogeny, an evolutionary tree with biologically-motivated restrictions. For PH, we extend recent work by further mapping the interface between ;;easy' and ;;hard' instances, within the framework of (k,l)-bounded instances where the number of 2's per column and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of MPPH we provide the first concrete, positive results for this problem. In addition, we construct for both PH and MPPH polynomial time approximation algorithms, based on properties of the columns of the input matrix.  相似文献   

6.
The multistate perfect phylogeny problem is a classic problem in computational biology. When no perfect phylogeny exists, it is of interest to find a set of characters to remove in order to obtain a perfect phylogeny in the remaining data. This is known as the character removal problem. We show how to use chordal graphs and triangulations to solve the character removal problem for an arbitrary number of states, which was previously unsolved. We outline a preprocessing technique that speeds up the computation of the minimal separators of a graph. Minimal separators are used in our solution to the missing data character removal problem and to Gusfield's solution of the perfect phylogeny problem with missing data.  相似文献   

7.
Hannenhalli and Pevzner gave the first polynomial-time algorithm for computing the inversion distance between two signed permutations, as part of the larger task of determining the shortest sequence of inversions needed to transform one permutation into the other. Their algorithm (restricted to distance calculation) proceeds in two stages: in the first stage, the overlap graph induced by the permutation is decomposed into connected components; then, in the second stage, certain graph structures (hurdles and others) are identified. Berman and Hannenhalli avoided the explicit computation of the overlap graph and gave an O(nalpha(n)) algorithm, based on a Union-Find structure, to find its connected components, where alpha is the inverse Ackerman function. Since for all practical purposes alpha(n) is a constant no larger than four, this algorithm has been the fastest practical algorithm to date. In this paper, we present a new linear-time algorithm for computing the connected components, which is more efficient than that of Berman and Hannenhalli in both theory and practice. Our algorithm uses only a stack and is very easy to implement. We give the results of computational experiments over a large range of permutation pairs produced through simulated evolution; our experiments show a speed-up by a factor of 2 to 5 in the computation of the connected components and by a factor of 1.3 to 2 in the overall distance computation.  相似文献   

8.
The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct theta(nm2) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003) be sped-up to O(nm + m2) time, which would imply an O(nm) time-bound for the PPH problem? Second, if there are multiple solutions, can we find one that is most parsimonious in terms of the number of distinct haplotypes. We give reductions that suggests that the answer to both questions is "no." For the first problem, we show that computing the output of the first step (in either method) is equivalent to Boolean matrix multiplication. Therefore, the best bound we can presently achieve is O(nm(omega-1)), where omega < or = 2.52 is the exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely requires a different approach. For the second problem of computing a PPH solution that minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979).  相似文献   

9.

Background

Haplotype assembly, reconstructing haplotypes from sequence data, is one of the major computational problems in bioinformatics. Most of the current methodologies for haplotype assembly are designed for diploid individuals. In recent years, genomes having more than two sets of homologous chromosomes have attracted many research groups that are interested in the genomics of disease, phylogenetics, botany and evolution. However, there is still a lack of methods for reconstructing polyploid haplotypes.

Results

In this work, the minimum error correction with genotype information (MEC/GI) model, an important combinatorial model for haplotyping a single individual, is used to study the triploid individual haplotype reconstruction problem. A fast and accurate enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving the MEC/GI model. The EHTLD algorithm tries to reconstruct the three haplotypes according to the order of single nucleotide polymorphism (SNP) loci along them. When reconstructing a given SNP site, the EHTLD algorithm enumerates three kinds of SNP values in terms of the corresponding site’s genotype value, and chooses the one, which leads to the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site, to fill the SNP loci being reconstructed.

Conclusion

Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments.
  相似文献   

10.
Haplotyping as perfect phylogeny: a direct approach.   总被引:4,自引:0,他引:4  
A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.  相似文献   

11.
The individual haplotyping problem is a computing problem of reconstructing two haplotypes for an individual based on several optimal criteria from one's fragments sequencing data. This paper is based on the fact that the length of a fragment and the number of the fragments covering a SNP (single nucleotide polymorphism) site are both very small compared with the length of a sequenced region and the total number of the fragments and introduces the parameterized haplotyping problems. With m fragments whose maximum length is k(1), n SNP sites and the number of the fragments covering a SNP site no more than k(2), our algorithms can solve the gapless MSR (Minimum SNP Removal) and MFR (Minimum Fragment Removal) problems in the time complexity O(nk(1)k(2) + m log m + nk(2) + mk(1)) and O(mk(2)(2) + mk(1) k(2) + m log m + nk(2) + mk(1))respectively. Since, the value of k(1) and k(2) are both small (about 10) in practice, our algorithms are more efficient and applicable compared with the algorithms of V. Bafna et al. of time complexity O(mn(2)) and O(m(2)n + m(3)), respectively.  相似文献   

12.
A reduction-based exact algorithm for the contact map overlap problem.   总被引:1,自引:0,他引:1  
Aligning proteins based on their structural similarity is a fundamental problem in molecular biology with applications in many settings, including structure classification, database search, function prediction, and assessment of folding prediction methods. Structural alignment can be done via several methods, including contact map overlap (CMO) maximization that aligns proteins in a way that maximizes the number of common residue contacts. In this paper, we develop a reduction-based exact algorithm for the CMO problem. Our approach solves CMO directly rather than after transformation to other combinatorial optimization problems. We exploit the mathematical structure of the problem in order to develop a number of efficient lower bounding, upper bounding, and reduction schemes. Computational experiments demonstrate that our algorithm runs significantly faster than existing exact algorithms and solves some hard CMO instances that were not solved in the past. In addition, the algorithm produces protein clusters that are in excellent agreement with the SCOP classification. An implementation of our algorithm is accessible as an on-line server at http://eudoxus.scs.uiuc.edu/cmos/cmos.html.  相似文献   

13.
The restriction scaffold assignment problem takes as input two finite point sets S and T (with S containing more points than T ) and establishes a correspondence between points in S and points in T , such that each point in S maps to exactly one point in T and each point in T maps to at least one point in S. An algorithm is presented that finds a minimum-cost solution for this problem in O(n log n) time, provided that the points in S and T are restricted to lie on a line and the cost function delta is the L(1) metric. This algorithm runs in linear time, if S and T are presorted. This improves the previously best-known O(n (2))-time algorithm for this problem.  相似文献   

14.
Chalcidoidea (Hymenoptera) are extremely diverse with more than 23,000 species described and over 500,000 species estimated to exist. This is the first comprehensive phylogenetic analysis of the superfamily based on a molecular analysis of 18S and 28S ribosomal gene regions for 19 families, 72 subfamilies, 343 genera and 649 species. The 56 outgroups are comprised of Ceraphronoidea and most proctotrupomorph families, including Mymarommatidae. Data alignment and the impact of ambiguous regions are explored using a secondary structure analysis and automated (MAFFT) alignments of the core and pairing regions and regions of ambiguous alignment. Both likelihood and parsimony approaches are used to analyze the data. Overall there is no impact of alignment method, and few but substantial differences between likelihood and parsimony approaches. Monophyly of Chalcidoidea and a sister group relationship between Mymaridae and the remaining Chalcidoidea is strongly supported in all analyses. Either Mymarommatoidea or Diaprioidea are the sister group of Chalcidoidea depending on the analysis. Likelihood analyses place Rotoitidae as the sister group of the remaining Chalcidoidea after Mymaridae, whereas parsimony nests them within Chalcidoidea. Some traditional family groups are supported as monophyletic (Agaonidae, Eucharitidae, Encyrtidae, Eulophidae, Leucospidae, Mymaridae, Ormyridae, Signiphoridae, Tanaostigmatidae and Trichogrammatidae). Several other families are paraphyletic (Perilampidae) or polyphyletic (Aphelinidae, Chalcididae, Eupelmidae, Eurytomidae, Pteromalidae, Tetracampidae and Torymidae). Evolutionary scenarios discussed for Chalcidoidea include the evolution of phytophagy, egg parasitism, sternorrhynchan parasitism, hypermetamorphic development and heteronomy.  相似文献   

15.
Members of the genus Limnodynastes are a prominent and widespread feature of the Australian frog fauna. Yet despite their potential to be informative about biogeographic history and mechanisms of speciation, the relationships among these taxa are not well known. We investigated phylogenetic relationships within the genus Limnodynastes via sequencing of mitochondrial (mt)DNA from current members of the genus Limnodynastes and the monotypic genus Megistolotis. a 450-bp fragment of the 16S rRNA gene and a 370-bp fragment of the protein-coding gene ND4 were used to infer a molecular phylogeny. We revise traditional species groupings and now recognize four species groups within Limnodynastes: the L. ornatus group (L. ornatus and L. spenceri), the L. peronii group (L. peronii, L. tasmaniensis, L. fletcheri, the L. depressus), the L. salmini group (L. salmini, L. convexiusculus, and L. lignarius), and the L. dorsalis group (L. dorsalis, L. terraereginae, L. dumerilii and L. interioris). The L. ornatus species group forms a highly distinctive clade that is a sister group to the other Limnodynastes groups. Pending broader phylogenetic studies it could be removed from the genus Limnodynastes. Our results concur with previous suggestions that Megistolotis lignarius is nested within Limnodynastes, and we therefore reclassify this species as Limnodynastes lignarius. Furthermore, specimens identified as L. depressus form a mtDNA lineage distinct from other species in the genus, confirming the validity of the species. Specimens of species from the L. dorsalis group (L. dorsalis, L. dumerilii, L. interioris, and L. terraereginae) are closely related such that L. dumerilii is paraphyletic with two other species. Finally, our study provides broad support for previous phylogenies based on microcomplement fixation.  相似文献   

16.
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.   相似文献   

17.
MOTIVATION: The problem of reconstructing full sibling groups from DNA marker data remains a significant challenge for computational biology. A recently published heuristic algorithm based on Mendelian exclusion rules and the Simpson index was successfully applied to the full sibship reconstruction (FSR) problem. However, the so-called SIMPSON algorithm has an unknown complexity measure, questioning its applicability range. RESULTS: We present a modified version of the SIMPSON (MS) algorithm that behaves as O(n(3)) and achieves the same or better accuracy when compared with the original algorithm. Performance of the MS algorithm was tested on a variety of simulated diploid population samples to verify its complexity measure and the significant improvement in efficiency (e.g. 100 times faster than SIMPSON in some cases). It has been shown that, in theory, the SIMPSON algorithm runs in non-polynomial time, significantly limiting its usefulness. It has been also verified via simulation experiments that SIMPSON could run in O(n(a)), where a > 3. AVAILABILITY: Computer code written in Java is available upon request from the first author. CONTACT: Dmitry.Konovalov@jcu.edu.au.  相似文献   

18.
Abstract. A study, based on examination of thirteen scarabaeoid families, was made of 134 adult and larval characters from the following character suites: 105 adult characters of the antennae, eye, epipharynx, mandible, maxillae, labium, tentorium, trochantin, procoxae, mesocoxae, mesothoracic spiracles, hind wing articulation, hind wing base, hind wing venation, hind wing folding, abdominal sternites, abdominal spiracles, male genitalia, ovarioles and karyotype; twenty larval characters of the antennae, fronto-clypeal suture, stemmata, labial palpi, maxillae, mandibles, legs, stridulatory apparatus, spiracles and ecdysial process; and nine adult and larval biological characters. In order to assess the reliability of different characters in resolving scarabaeoid family relationships, six data sets were subjected to cladistic analysis: the total evidence character set (134 characters), restricted adult character set (thirty-two characters, not including those of the wings), wing character set (seventy-three characters), larval character set (twenty characters), biological character set (nine characters) and re-coded Howden (1982) character set (thirty-nine characters). The complete character set and wing character set both produced phylograms with all nodes resolved; the restricted adult data set, larval data set, Howden (1982) data set and biological data set produced phylograms with diminishing levels of node resolution. The reconstructed phylogeny, from the preferred phylogram of the total evidence character set, shows that the Scarabaeoidea comprises three major lineages; a glaresid, passalid and scarabaeid lineage. The glaresid lineage consists only of the Glaresidae. The passalid lineage comprises two major lines; a glaphyrid line (containing Glaphyridae, Passalidae, Lucanidae, Diphyllostomatidae, Trogidae, Bolboceratidae and Pleocomidae) and a geotrupid line (containing Geotrupidae, Ochodaeidae, Ceratocanthidae and Hybosoridae). The scarabaeid lineage contains those taxa traditionally included within the Scarabaeidae (Aphodiinae, Scarabaeinae, Orphninae, Melolonthinae, Acoma, Chasmatopterinae, Hopliinae, Oncerinae, Rutelinae, Dynastinae, Trichiinae, Cetoniinae and Valginae).  相似文献   

19.
Drosophila melanogaster belongs to a closely related group of eight species collectively known as the melanogaster subgroup; all are native to sub-Saharan Africa and islands off the east coast of Africa. The phylogenetic relationships of most species in this subgroup have been well documented; however, the three most closely related species, D. simulans, D. sechellia, and D. mauritiana, have remained problematic from a phylogenetic standpoint as no data set has unambiguously resolved them. We present new DNA sequence data on the nullo and Serendipity-alpha genes and combine them with all available nuclear DNA sequence data; the total data encompass 12 genes and the ITS of rDNA. A methodological problem arose because nine of the genes had information on intraspecific polymorphisms in at least one species. We explored the effect of inclusion/exclusion of polymorphic sites and found that it had very little effect on phylogenetic inferences, due largely to the fact that 82% of polymorphisms are autapomorphies (unique to one species). We have also reanalyzed our previous DNA-DNA hybridization data with a bootstrap procedure. The combined sequence data set and the DNA-DNA hybridization data strongly support the sister status of the two island species, D. sechellia and D. mauritiana. This at least partially resolves what had been a paradox of parallel evolution in these two species.   相似文献   

20.
Litvaitis  M. K.  Newman  L. J. 《Hydrobiologia》2001,444(1-3):177-182
Systematic relationships within the cotylean family Pseudocerotidae were examined using nucleotide sequences of the D3 expansion segment of the 28S rDNA gene. A previously suggested separation of Pseudoceros and Pseudobiceros based on the number of male reproductive systems was confirmed. Regardless of the algorithm employed, Pseudoceros always formed a monophyletic clade. Pseudobiceros appeared to be paraphyletic; however, a constrained maximum parsimony tree was not significantly longer (2 steps, = 0.05). Additionally, the genera Maiazoon, Phrikoceros and Tytthosoceros were validated as taxonomic entities, and their relationships to other genera within the family were determined. Molecular data also supported species separations based on colour patterns. An intraspecific genetic distance of 1.14% was found for Pseudoceros bifurcus, whereas the intrageneric distance was 3.58%. Genetic distances among genera varied, with the closest distance being 2.048% between Pseudobiceros and Maiazoon, and the largest distance (8.345%) between Pseudoceros and Tytthosoceros.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号