首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Haplotype inference by maximum parsimony   总被引:5,自引:0,他引:5  
MOTIVATION: Haplotypes have been attracting increasing attention because of their importance in analysis of many fine-scale molecular-genetics data. Since direct sequencing of haplotype via experimental methods is both time-consuming and expensive, haplotype inference methods that infer haplotypes based on genotype samples become attractive alternatives. RESULTS: (1) We design and implement an algorithm for an important computational model of haplotype inference that has been suggested before in several places. The model finds a set of minimum number of haplotypes that explains the genotype samples. (2) Strong supports of this computational model are given based on the computational results on both real data and simulation data. (3) We also did some comparative study to show the strength and weakness of this computational model using our program. AVAILABILITY: The software HAPAR is free for non-commercial uses. Available upon request (lwang@cs.cityu.edu.hk).  相似文献   

2.
Haplotype data are especially important in the study of complex diseases since it contains more information than genotype data. However, obtaining haplotype data is technically difficult and costly. Computational methods have proved to be an effective way of inferring haplotype data from genotype data. One of these methods, the haplotype inference by pure parsimony approach (HIPP), casts the problem as an optimization problem and as such has been proved to be NP-hard. We have designed and developed a new preprocessing procedure for this problem. Our proposed algorithm works with groups of haplotypes rather than individual haplotypes. It iterates searching and deleting haplotypes that are not helpful in order to find the optimal solution. This preprocess can be coupled with any of the current solvers for the HIPP that need to preprocess the genotype data. In order to test it, we have used two state-of-the-art solvers, RTIP and GAHAP, and simulated and real HapMap data. Due to the computational time and memory reduction caused by our preprocess, problem instances that were previously unaffordable can be now efficiently solved.  相似文献   

3.
In 2003, Gusfield introduced the haplotype inference by pure parsimony (HIPP) problem and presented an integer program (IP) that quickly solved many simulated instances of the problem. Although it solved well on small instances, Gusfield's IP can be of exponential size in the worst case. Several authors have presented polynomial-sized IPs for the problem. In this paper, we further the work on IP approaches to HIPP. We extend the existing polynomial-sized IPs by introducing several classes of valid cuts for the IP. We also present a new polynomial-sized IP formulation that is a hybrid between two existing IP formulations and inherits many of the strengths of both. Many problems that are too complex for the exponential-sized formulations can still be solved in our new formulation in a reasonable amount of time. We provide a detailed empirical comparison of these IP formulations on both simulated and real genotype sequences. Our formulation can also be extended in a variety of ways to allow errors in the input or model the structure of the population under consideration.  相似文献   

4.
Haplotype information plays an important role in many genetic analyses. However, the identification of haplotypes based on sequencing methods is both expensive and time consuming. Current sequencing methods are only efficient to determine conflated data of haplotypes, that is, genotypes. This raises the need to develop computational methods to infer haplotypes from genotypes.Haplotype inference by pure parsimony is an NP-hard problem and still remains a challenging task in bioinformatics. In this paper, we propose an efficient ant colony optimization (ACO) heuristic method, named ACOHAP, to solve the problem. The main idea is based on the construction of a binary tree structure through which ants can travel and resolve conflated data of all haplotypes from site to site. Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods. ACOHAP is as good as the currently best exact method, RPoly, on small data sets. However, it is much better than RPoly on large data sets. These results demonstrate the efficiency of the ACOHAP algorithm to solve the haplotype inference by pure parsimony problem for both small and large data sets.  相似文献   

5.
Haplotypes include essential SNP information used for a variety of purposes such as investigating potential links between certain diseases and genetic variations. Given a set of genotypes, the haplotype inference problem based on pure parsimony is the problem of finding a minimum set of haplotypes that explains all the given genotypes. The problem is especially important because, while it is fairly inexpensive to obtain genotypes, other approaches to obtaining haplotypes are significantly expensive. There are two types of methods proposed for the problem, namely exact and inexact methods. Existing exact methods guarantee obtaining purely parsimonious solutions but have exponential time-complexities and are not practical for large number or length of genotypes. However, inexact methods are relatively fast but do not always obtain optimum solutions. In this paper, an improved heuristic is proposed, based on which new inexact and exact methods are provided. Experimental results indicate that the proposed methods replace the state-of-the-art inexact and exact methods for the problem.  相似文献   

6.
S Ramadhani  SR Mousavi  M Talebi 《Gene》2012,498(2):177-182
We cloned a gene, kexD, that provides a multidrug-resistant phenotype from multidrug-resistant Klebsiella pneumoniae MGH78578. The deduced amino acid sequence of KexD is similar to that of the inner membrane protein, RND-type multidrug efflux pump. Introduction of the kexD gene into Escherichia coli KAM32 resulted in a MIC that was higher for erythromycin, novobiocin, rhodamine 6G, tetraphenylphosphonium chloride, and ethidium bromide than that of the control. Intracellular ethidium bromide levels in E. coli cells carrying the kexD gene were lower than that in the control cells under energized conditions, suggesting that KexD is a component of an energy-dependent efflux pump. RND-type pumps typically consist of three components: an inner membrane protein, a periplasmic protein, and an outer membrane protein. We discovered that KexD functions with a periplasmic protein, AcrA, from E. coli and K. pneumoniae, but not with the periplasmic proteins KexA and KexG from K. pneumoniae. KexD was able to utilize either TolC of E. coli or KocC of K. pneumoniae as an outer membrane component. kexD mRNA was not detected in K. pneumoniae MGH78578 or ATCC10031. We isolated erythromycin-resistant mutants from K. pneumoniae ATCC10031, and some showed a multidrug-resistant phenotype similar to the drug resistance pattern of KexD. Two strains of multidrug-resistant mutants were investigated for kexD expression; kexD mRNA levels were increased in these strains. We conclude that changing kexD expression can contribute to the occurrence of multidrug-resistant K. pneumoniae.  相似文献   

7.
8.
Inferring phylogeny is a difficult computational problem. For example, for only 13 taxa, there are more then 13 billion possible unrooted phylogenetic trees. Heuristics are necessary to minimize the time spent evaluating non-optimal trees. We describe here an approach for heuristic searching, using a genetic algorithm, that can reduce the time required for weighted maximum parsimony phylogenetic inference, especially for data sets involving a large number of taxa. It is the first implementation of a weighted maximum parsimony criterion using amino acid sequences. To validate the weighted criterion, we used an artificial data set and compared it to a number of other phylogenetic methods. Genetic algorithms mimic the natural selection's ability to solve complex problems. We have identified several parameters affecting the genetic algorithm. Methods were developed to validate these parameters, ensuring optimal performance. This approach allows the construction of phylogenetic trees with over 200 taxa in practical time on a regular PC.  相似文献   

9.

Background

The three-dimensional shape of grain, measured as grain length, width, and thickness (GL, GW, and GT), is one of the most important components of grain appearance in rice. Determining the genetic basis of variations in grain shape could facilitate efficient improvements in grain appearance. In this study, an F7:8 recombinant inbred line population (RIL) derived from a cross between indica and japonica cultivars (Nanyangzhan and Chuan7) contrasting in grain size was used for quantitative trait locus (QTL) mapping. A genetic linkage map was constructed with 164 simple sequence repeat (SSR) markers. The major aim of this study was to detect a QTL for grain shape and to fine map a minor QTL, qGL7.

Results

Four QTLs for GL were detected on chromosomes 3 and 7, and 10 QTLs for GW and 9 QTLs for GT were identified on chromosomes 2, 3, 5, 7, 9 and 10, respectively. A total of 28 QTLs were identified, of which several are reported for the first time; four major QTLs and six minor QTLs for grain shape were also commonly detected in both years. The minor QTL, qGL7, exhibited pleiotropic effects on GL, GW, GT, 1000-grain weight (TGW), and spikelets per panicle (SPP) and was further validated in a near isogenic F2 population (NIL-F2). Finally, qGL7 was narrowed down to an interval between InDel marker RID711 and SSR marker RM6389, covering a 258-kb region in the Nipponbare genome, and cosegregated with InDel markers RID710 and RID76.

Conclusion

Materials with very different phenotypes were used to develop mapping populations to detect QTLs because of their complex genetic background. Progeny tests proved that the minor QTL, qGL7, could display a single mendelian characteristic. Therefore, we suggested that minor QTLs for traits with high heritability could be isolated using a map-based cloning strategy in a large NIL-F2 population. In addition, combinations of different QTLs produced diverse grain shapes, which provide the ability to breed more varieties of rice to satisfy consumer preferences.  相似文献   

10.
Hapi is a new dynamic programming algorithm that ignores uninformative states and state transitions in order to efficiently compute minimum-recombinant and maximum likelihood haplotypes. When applied to a dataset containing 103 families, Hapi performs 3.8 and 320 times faster than state-of-the-art algorithms. Because Hapi infers both minimum-recombinant and maximum likelihood haplotypes and applies to related individuals, the haplotypes it infers are highly accurate over extended genomic distances.  相似文献   

11.
Summary Rates of evolution for cytochromec over the past one billion years were calculated from a maximum parsimony dendrogram which approximates the phylogeny of 87 lineages. Two periods of evolutionary acceleration and deceleration apparently occurred for the cytochromec molecule. The tempo of evolutionary change indicated by this analysis was compared to the patterns of acceleration and deceleration in the ancestry of several other proteins The synchrony of these tempos of molecular change supports the notion that rapid genetic evolution accompanied periods of major adaptive radiations.Rates of change at different times in several structural-functional areas of cytochromec were also investigated in order to test the Darwinian hypothesis that during periods of rapid evolution, functional sites accumulate proportionately more substitutions than areas with no known function. Rates of change in four proposed functional groupings of sites were therefore compared to rates in areas of unknown function for several different time periods. This analysis revealed a significant increase in the rate of evolution for sites associated with the regions of cytochromec oxidase and reductase interaction during the period between the emergence of the eutherian ancestor to the emergence of the anthropoid ancestor.  相似文献   

12.
The haplotype block structure of SNP variation in human DNA has been demonstrated by several recent studies. The presence of haplotype blocks can be used to dramatically increase the statistical power of genetic mapping. Several criteria have already been proposed for identifying these blocks, all of which require haplotypes as input. We propose a comprehensive statistical model of haplotype block variation and show how the parameters of this model can be learned from haplotypes and/or unphased genotype data. Using real-world SNP data, we demonstrate that our approach can be used to resolve genotypes into their constituent haplotypes with greater accuracy than previously known methods.  相似文献   

13.
MOTIVATION: Haplotype information has become increasingly important in analyzing fine-scale molecular genetics data, such as disease genes mapping and drug design. Parsimony haplotyping is one of haplotyping problems belonging to NP-hard class. RESULTS: In this paper, we aim to develop a novel algorithm for the haplotype inference problem with the parsimony criterion, based on a parsimonious tree-grow method (PTG). PTG is a heuristic algorithm that can find the minimum number of distinct haplotypes based on the criterion of keeping all genotypes resolved during tree-grow process. In addition, a block-partitioning method is also proposed to improve the computational efficiency. We show that the proposed approach is not only effective with a high accuracy, but also very efficient with the computational complexity in the order of O(m2n) time for n single nucleotide polymorphism sites in m individual genotypes. AVAILABILITY: The software is available upon request from the authors, or from http://zhangroup.aporc.org/bioinfo/ptg/ CONTACT: chen@elec.osaka-sandai.ac.jp SUPPLEMENTARY INFORMATION: Supporting materials is available from http://zhangroup.aporc.org/bioinfo/ptg/bti572supplementary.pdf  相似文献   

14.
The existence of haplotype blocks transmitted from parents to offspring has been suggested recently. This has created an interest in the inference of the block structure and length. The motivation is that haplotype blocks that are characterized well will make it relatively easier to quickly map all the genes carrying human diseases. To study the inference of haplotype block systematically, we propose a statistical framework. In this framework, the optimal haplotype block partitioning is formulated as the problem of statistical model selection; missing data can be handled in a standard statistical way; population strata can be implemented; block structure inference/hypothesis testing can be performed; prior knowledge, if present, can be incorporated to perform a Bayesian inference. The algorithm is linear in the number of loci, instead of NP-hard for many such algorithms. We illustrate the applications of our method to both simulated and real data sets.  相似文献   

15.
In this report, we examine the validity of the haplotype block concept by comparing block decompositions derived from public data sets by variants of several leading methods of block detection. We first develop a statistical method for assessing the concordance of two block decompositions. We then assess the robustness of inferred haplotype blocks to the specific detection method chosen, to arbitrary choices made in the block-detection algorithms, and to the sample analyzed. Although the block decompositions show levels of concordance that are very unlikely by chance, the absolute magnitude of the concordance may be low enough to limit the utility of the inference. For purposes of SNP selection, it seems likely that methods that do not arbitrarily impose block boundaries among correlated SNPs might perform better than block-based methods.  相似文献   

16.
The problem of inferring haplotypes from genotypes of single nucleotide polymorphisms (SNPs) is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities and other complex traits. The problem can be formulated as a mixture model, where the mixture components correspond to the pool of haplotypes in the population. The size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. Thus methods for fitting the genotype mixture must crucially address the problem of estimating a mixture with an unknown number of mixture components. In this paper we present a Bayesian approach to this problem based on a nonparametric prior known as the Dirichlet process. The model also incorporates a likelihood that captures statistical errors in the haplotype/genotype relationship trading off these errors against the size of the pool of haplotypes. We describe an algorithm based on Markov chain Monte Carlo for posterior inference in our model. The overall result is a flexible Bayesian method, referred to as DP-Haplotyper, that is reminiscent of parsimony methods in its preference for small haplotype pools. We further generalize the model to treat pedigree relationships (e.g., trios) between the population's genotypes. We apply DP-Haplotyper to the analysis of both simulated and real genotype data, and compare to extant methods.  相似文献   

17.
18.
Haplotype inference from phase-ambiguous multilocus genotype data is an important task for both disease-gene mapping and studies of human evolution. We report a novel haplotype-inference method based on a coalescence-guided hierarchical Bayes model. In this model, a hierarchical structure is imposed on the prior haplotype frequency distributions to capture the similarities among modern-day haplotypes attributable to their common ancestry. As a consequence, the model both allows distinct haplotypes to have different a priori probabilities according to the inferred hierarchical ancestral structure and results in a proper joint posterior distribution for all the parameters of interest. A Markov chain-Monte Carlo scheme is designed to draw from this posterior distribution. By using coalescence-based simulation and empirically generated data sets (Whitehead Institute's inflammatory bowel disease data sets and HapMap data sets), we demonstrate the merits of the new method in comparison with HAPLOTYPER and PHASE, with or without the presence of recombination hotspots and missing genotypes.  相似文献   

19.
Summary Phylogenetic trees requiring the lowest sum of nucleotide replacements and gene duplicative events were constructed from the amino acid sequence data on ten gnathostome parvalbumins (PAR) and two related myofibrillar proteins troponin-C (TNC) and myosin alkali-light-chain (ALC). The origin and differentiation of the structural domains within these proteins were also investigated by the maximum parsimony method and by an alignment statistic for identifying evolutionarily related protein sequences. The results suggest, in agreement with the Weeds-McLachlan model, that tandem duplications in a precursor gene caused a primordial one-domain polypeptide (consisting of two helices with a calcium binding region in between) to double and then quadruple in size. Duplications of the gene coding for this four domain (I–II–III–IV) protein in an early metazoan, pre-gnathostome lineage gave rise to the separate loci for TNC, ALC, and PAR. TNC, which alone retained the Ca-binding function in each of its four domains, evolved much more slowly than either the ALC or PAR lineages. In the PAR lineage the I–II–III–IV structure was degraded, presumably by a partial gene deletion, to the II–III–IV structure during descent to the gnathostome ancestor of parvalbumins. Also during this period the mid region in domain II lost its Ca-binding function and, as it did so, evolved at an accelerated rate over other regions, a pattern indicative of positive selection for a change in function. In turn, from the gnathostome ancestor to the present, the mid regions of domains III and IV, which each retained Ca-bindung function, evolved much more slowly than other regions, a pattern indicative of stabilizing selection for preservation of function. Between the gnathostome and teleost-tetrapod ancestor a gene duplication separated the parvalbumins into an-lineage and a-lineage. During this early vertebrate period PAR genes evolved at the extremely fast rate of 89 nucleotide replacements per 100 codons per 108 years (i.e. 89 NR %), but from the teleost-tetrapod ancestor to the present, both- and-PAR lineages evolved at a much slower rate, about 8 NR %. The use of-parvalbumins as phylogenetic markers was complicated by presumptive evidence that paralogous (i.e. duplication dependent) gene lineages occur within this group. As a final point, in the genealogy of TNC, ALC, and PAR lineages, a non-random pattern of nucleotide replacements was observed between the reconstructed ancestral and descendant mRNA sequences. The pattern was similar to that observed for other protein genealogies and seems to reflect a bias in the genetic code for guanine to adenine and adenine to guanine transitions (especially at the first nucleotide position of the RNA codons) to produce amino acid substitutions which are compatible with the preservation of protein three-dimensional structure.  相似文献   

20.
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号