首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
Estimating haplotype frequencies becomes increasingly important in the mapping of complex disease genes, as millions of single nucleotide polymorphisms (SNPs) are being identified and genotyped. When genotypes at multiple SNP loci are gathered from unrelated individuals, haplotype frequencies can be accurately estimated using expectation-maximization (EM) algorithms (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al., 1995), with standard errors estimated using bootstraps. However, because the number of possible haplotypes increases exponentially with the number of SNPs, handling data with a large number of SNPs poses a computational challenge for the EM methods and for other haplotype inference methods. To solve this problem, Niu and colleagues, in their Bayesian haplotype inference paper (Niu et al., 2002), introduced a computational algorithm called progressive ligation (PL). But their Bayesian method has a limitation on the number of subjects (no more than 100 subjects in the current implementation of the method). In this paper, we propose a new method in which we use the same likelihood formulation as in Excoffier and Slatkin's EM algorithm and apply the estimating equation idea and the PL computational algorithm with some modifications. Our proposed method can handle data sets with large number of SNPs as well as large numbers of subjects. Simultaneously, our method estimates standard errors efficiently, using the sandwich-estimate from the estimating equation, rather than the bootstrap method. Additionally, our method admits missing data and produces valid estimates of parameters and their standard errors under the assumption that the missing genotypes are missing at random in the sense defined by Rubin (1976).  相似文献   

2.
OBJECTIVES: Linkage disequilibrium (LD) between closely spaced SNPs can be accommodated in linkage analysis by specifying the multi-SNP haplotype frequencies, if known. Phased haplotypes in candidate regions can provide gold standard haplotype frequency estimates, and may be of inherent interest as markers. We evaluated the effects of different methods of haplotype frequency estimation, and the use of marker phase information, on linkage analysis of a multi-SNP cluster in a candidate region for Alzheimer's disease (AD). METHODS: We performed parametric linkage analysis of a five-SNP cluster in extended pedigrees to compare the use of: (1) haplotype frequencies estimated by molecular phase determination, maximum likelihood estimation, or by assuming linkage equilibrium (LE); (2) AD families or controls as the frequency source; and (3) unphased or molecularly phased SNP data. RESULTS: There was moderate to strong pairwise LD among the five SNPs. Falsely assuming LE substantially inflated the LOD score, but the method of haplotype frequency estimation and particular sample used made little difference provided that LD was accommodated. Use of phased haplotypes produced a modest increase in the LOD score over unphased SNPs. CONCLUSIONS: Ignoring LD between markers can lead to substantially inflated evidence for linkage in LOD score analysis of extended pedigrees with missing data. Use of marker phase information in linkage analysis may be important in disease studies where the costs of family recruitment and phenotyping greatly exceed the costs of phase determination.  相似文献   

3.
MOTIVATION: Haplotype reconstruction is an essential step in genetic linkage and association studies. Although many methods have been developed to estimate haplotype frequencies and reconstruct haplotypes for a sample of unrelated individuals, haplotype reconstruction in large pedigrees with a large number of genetic markers remains a challenging problem. METHODS: We have developed an efficient computer program, HAPLORE (HAPLOtype REconstruction), to identify all haplotype sets that are compatible with the observed genotypes in a pedigree for tightly linked genetic markers. HAPLORE consists of three steps that can serve different needs in applications. In the first step, a set of logic rules is used to reduce the number of compatible haplotypes of each individual in the pedigree as much as possible. After this step, the haplotypes of all individuals in the pedigree can be completely or partially determined. These logic rules are applicable to completely linked markers and they can be used to impute missing data and check genotyping errors. In the second step, a haplotype-elimination algorithm similar to the genotype-elimination algorithms used in linkage analysis is applied to delete incompatible haplotypes derived from the first step. All superfluous haplotypes of the pedigree members will be excluded after this step. In the third step, the expectation-maximization (EM) algorithm combined with the partition and ligation technique is used to estimate haplotype frequencies based on the inferred haplotype configurations through the first two steps. Only compatible haplotype configurations with haplotypes having frequencies greater than a threshold are retained. RESULTS: We test the effectiveness and the efficiency of HAPLORE using both simulated and real datasets. Our results show that, the rule-based algorithm is very efficient for completely genotyped pedigree. In this case, almost all of the families have one unique haplotype configuration. In the presence of missing data, the number of compatible haplotypes can be substantially reduced by HAPLORE, and the program will provide all possible haplotype configurations of a pedigree under different circumstances, if such multiple configurations exist. These inferred haplotype configurations, as well as the haplotype frequencies estimated by the EM algorithm, can be used in genetic linkage and association studies. AVAILABILITY: The program can be downloaded from http://bioinformatics.med.yale.edu.  相似文献   

4.
Current routine genotyping methods typically do not provide haplotype information, which is essential for many analyses of fine-scale molecular-genetics data. Haplotypes can be obtained, at considerable cost, experimentally or (partially) through genotyping of additional family members. Alternatively, a statistical method can be used to infer phase and to reconstruct haplotypes. We present a new statistical method, applicable to genotype data at linked loci from a population sample, that improves substantially on current algorithms; often, error rates are reduced by > 50%, relative to its nearest competitor. Furthermore, our algorithm performs well in absolute terms, suggesting that reconstructing haplotypes experimentally or by genotyping additional family members may be an inefficient use of resources.  相似文献   

5.
Browning SR 《Genetics》2008,178(4):2123-2132
I present a new approach for calculating probabilities of identity by descent for pairs of haplotypes. The approach is based on a joint hidden Markov model for haplotype frequencies and identity by descent (IBD). This model allows for linkage disequilibrium, and the method can be applied to very dense marker data. The method has high power for detecting IBD tracts of genetic length of 1 cM, with the use of sufficiently dense markers. This enables detection of pairwise IBD between haplotypes from individuals whose most recent common ancestor lived up to 50 generations ago.  相似文献   

6.
A new method for haplotype inference including full-sib information   总被引:1,自引:0,他引:1       下载免费PDF全文
Ding XD  Simianer H  Zhang Q 《Genetics》2007,177(3):1929-1940
Recent literature has suggested that haplotype inference through close relatives, especially from nuclear families, can be an alternative strategy in determining linkage phase and estimating haplotype frequencies. In the case of no possibility to obtain genotypes for parents, and only full-sib information being used, a new approach is suggested to infer phase and to reconstruct haplotypes. We present a maximum-likelihood method via an expectation-maximization algorithm, called FSHAP, using only full-sib information when parent information is not available. FSHAP can deal with families with an arbitrary number of children, and missing parents or missing genotypes can be handled as well. In a simulation study we compare FSHAP with another existing expectation-maximization (EM)-based approach (FAMHAP), the conditioning approach implemented in FBAT and GENEHUNTER, which is only pedigree based and assumes linkage equilibrium. In most situations, FSHAP has the smallest discrepancy of haplotype frequency estimation and the lowest error rate in haplotype reconstruction, only in some cases FAMHAP yields comparable results. GENEHUNTER produces the largest discrepancy, and FBAT produces the highest error rate in offspring in most situations. Among the methods compared, FSHAP has the highest accuracy in reconstructing the diplotypes of the unavailable parents. Potential limitations of the method, e.g., in analyzing very large haplotypes, are indicated and possible solutions are discussed.  相似文献   

7.
Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows–Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021).  相似文献   

8.
Data mining applied to linkage disequilibrium mapping   总被引:11,自引:0,他引:11       下载免费PDF全文
We introduce a new method for linkage disequilibrium mapping: haplotype pattern mining (HPM). The method, inspired by data mining methods, is based on discovery of recurrent patterns. We define a class of useful haplotype patterns in genetic case-control data and use the algorithm for finding disease-associated haplotypes. The haplotypes are ordered by their strength of association with the phenotype, and all haplotypes exceeding a given threshold level are used for prediction of disease susceptibility-gene location. The method is model-free, in the sense that it does not require (and is unable to utilize) any assumptions about the inheritance model of the disease. The statistical model is nonparametric. The haplotypes are allowed to contain gaps, which improves the method's robustness to mutations and to missing and erroneous data. Experimental studies with simulated microsatellite and SNP data show that the method has good localization power in data sets with large degrees of phenocopies and with lots of missing and erroneous data. The power of HPM is roughly identical for marker maps at a density of 3 single-nucleotide polymorphisms/cM or 1 microsatellite/cM. The capacity to handle high proportions of phenocopies makes the method promising for complex disease mapping. An example of correct disease susceptibility-gene localization with HPM is given with real marker data from families from the United Kingdom affected by type 1 diabetes. The method is extendable to include environmental covariates or phenotype measurements or to find several genes simultaneously.  相似文献   

9.
Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally ‘unrelated’ individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.  相似文献   

10.
In genetic studies the haplotype structure of the regarded population is expected to carry important information. Experimental methods to derive haplotypes, however, are expensive and none of them has yet become standard methodology. On the other hand, maximum likelihood haplotype estimation from unphased individual genotypes may incur inaccuracies. We therefore investigated the relative efficiency of haplotype frequency estimation when nuclear family information is included compared to estimation from experimentally derived haplotypes. Efficiency was measured in terms of variance ratios of the estimates. The variances were derived from the binomial distribution for experimentally derived haplotypes, and from the Fisher information matrix corresponding to the general likelihood function of the haplotype frequency parameters, including family information. We subsequently compared these variance ratios to the variance ratios for the case of estimation from individual genotypes. We found that the information gained from a single child compensates missing phase information to a high degree, resulting in estimates almost as reliable as those derived from observed haplotypes. Thus, if children have already been genotyped for other reasons, it is highly recommendable to include them into the estimation. If child information is not already present, it depends on the number of loci and the haplotype diversity if it is useful to genotype a single child just to reduce phase ambiguity. In general, if the number of loci is less than or equal to three or if the number of haplotypes with a frequency >5% is less than or equal to four, haplotype estimation from individuals is quite good already and the improvement gained from a single child can not compensate the genotyping effort for it. On the other hand, under scenarios with many loci and high haplotype diversity, haplotype frequency estimation from trios can be more efficient than haplotype frequency estimation from individuals also on a per genotype base.  相似文献   

11.
We present a method to perform fine mapping by placing haplotypes into clusters on the basis of risk. Each cluster has a haplotype "center." Cluster allocation is defined according to haplotype centers, with each haplotype assigned to the cluster with the "closest" center. The closeness of two haplotypes is determined by a similarity metric that measures the length of the shared segment around the location of a putative functional mutation for the particular cluster. Our method allows for missing marker information but still estimates the risks of complete haplotypes without resorting to a one-marker-at-a-time analysis. The dimensionality issues that can occur in haplotype analyses are removed by sampling over the haplotype space, allowing for estimation of haplotype risks without explicitly assigning a parameter to each haplotype to be estimated. In this way, we are able to handle haplotypes of arbitrary size. Furthermore, our clustering approach has the potential to allow us to detect the presence of multiple functional mutations.  相似文献   

12.

Background  

Genomewide association studies have resulted in a great many genomic regions that are likely to harbor disease genes. Thorough interrogation of these specific regions is the logical next step, including regional haplotype studies to identify risk haplotypes upon which the underlying critical variants lie. Pedigrees ascertained for disease can be powerful for genetic analysis due to the cases being enriched for genetic disease. Here we present a Monte Carlo based method to perform haplotype association analysis. Our method, hapMC, allows for the analysis of full-length and sub-haplotypes, including imputation of missing data, in resources of nuclear families, general pedigrees, case-control data or mixtures thereof. Both traditional association statistics and transmission/disequilibrium statistics can be performed. The method includes a phasing algorithm that can be used in large pedigrees and optional use of pseudocontrols.  相似文献   

13.
14.
Gene mapping and genetic epidemiology require large-scale computation of likelihoods based on human pedigree data. Although computation of such likelihoods has become increasingly sophisticated, fast calculations are still impeded by complex pedigree structures, by models with many underlying loci and by missing observations on key family members. The current paper 'introduces' a new method of array factorization that substantially accelerates linkage calculations with large numbers of markers. This method is not limited to nuclear families or to families with complete phenotyping. Vectorization and parallelization are two general-purpose hardware techniques for accelerating computations. These techniques can assist in the rapid calculation of genetic likelihoods. We describe our experience using both of these methods with the existing program MENDEL. A vectorized version of MENDEL was run on an IBM 3090 supercomputer. A parallelized version of MENDEL was run on parallel machines of different architectures and on a network of workstations. Applying these revised versions of MENDEL to two challenging linkage problems yields substantial improvements in computational speed.  相似文献   

15.
Segments of indentity-by-descent (IBD) detected from high-density genetic data are useful for many applications, including long-range phase determination, phasing family data, imputation, IBD mapping, and heritability analysis in founder populations. We present Refined IBD, a new method for IBD segment detection. Refined IBD achieves both computational efficiency and highly accurate IBD segment reporting by searching for IBD in two steps. The first step (identification) uses the GERMLINE algorithm to find shared haplotypes exceeding a length threshold. The second step (refinement) evaluates candidate segments with a probabilistic approach to assess the evidence for IBD. Like GERMLINE, Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses. To investigate the properties of Refined IBD, we simulate SNP data from a model with recent superexponential population growth that is designed to match United Kingdom data. The simulation results show that Refined IBD achieves a better power/accuracy profile than fastIBD or GERMLINE. We find that a single run of Refined IBD achieves greater power than 10 runs of fastIBD. We also apply Refined IBD to SNP data for samples from the United Kingdom and from Northern Finland and describe the IBD sharing in these data sets. Refined IBD is powerful, highly accurate, and easy to use and is implemented in Beagle version 4.  相似文献   

16.
Dense genotype data can be used to detect chromosome fragments inherited from a common ancestor in apparently unrelated individuals. A disease-causing mutation inherited from a common founder may thus be detected by searching for a common haplotype signature in a sample population of patients. We present here FounderTracker, a computational method for the genome-wide detection of founder mutations in cancer using dense tumor SNP profiles. Our method is based on two assumptions. First, the wild-type allele frequently undergoes loss of heterozygosity (LOH) in the tumors of germline mutation carriers. Second, the overlap between the ancestral chromosome fragments inherited from a common founder will define a minimal haplotype conserved in each patient carrying the founder mutation. Our approach thus relies on the detection of haplotypes with significant identity by descent (IBD) sharing within recurrent regions of LOH to highlight genomic loci likely to harbor a founder mutation. We validated this approach by analyzing two real cancer data sets in which we successfully identified founder mutations of well-characterized tumor suppressor genes. We then used simulated data to evaluate the ability of our method to detect IBD tracts as a function of their size and frequency. We show that FounderTracker can detect haplotypes of low prevalence with high power and specificity, significantly outperforming existing methods. FounderTracker is thus a powerful tool for discovering unknown founder mutations that may explain part of the "missing" heritability in cancer. This method is freely available and can be used online at the FounderTracker website.  相似文献   

17.
OBJECTIVE: The Association in the Presence of Linkage test (APL) is a powerful statistical method that allows for missing parental genotypes in nuclear families. However, in its original form, the statistic does not easily extend to mixed nuclear family structures nor to multiple-marker haplotypes. Furthermore, the robustness of APL in practice has not been examined. Here we present a generalization of the APL model and examination of its robustness under a variety of non-standard scenarios. METHODS: The generalization is made possible by incorporating a bootstrap variance estimator instead of the original robust variance estimator. This allows for use of more than two affected siblings. Haplotype analysis was accomplished by combining estimation of haplotype phase into the EM algorithm. Computer simulation was used to examine robustness of the APL to departures from test assumptions. RESULTS: The extended APL tests both single-marker and multiple-marker haplotypes and shows more power than other association methods. Simulation results showed that the single-marker APL test is robust to the departure from HWE. For the haplotype test, violation of the HWE assumption can inflate type I error. We also evaluated general guidelines for the validity of APL with rare alleles and rare haplotypes. Software for the APL test is available from http://www.chg.duke.edu/research/apl.html.  相似文献   

18.
To map quantitative trait loci (QTL) for growth and carcass traits in a purebred Japanese Black cattle population, we conducted multiple QTL analyses using 15 paternal half-sib families comprising 7860 offspring. We identified 40 QTL with significant linkages at false discovery rates of less than 0.1, which included 12 for intramuscular fat deposition called marbling and 12 for cold carcass weight or body weight. The QTL each explained 2%–13% of the phenotypic variance. These QTL included many replications and shared hypothetical identical-by-descent (IBD) alleles. The QTL for CW on BTA14 was replicated in five families with significant linkages and in two families with a 1% chromosome-wise significance level. The seven sires shared a 1.1-Mb superior Q haplotype as a hypothetical IBD allele that corresponds to the critical region previously refined by linkage disequilibrium mapping. The QTL for marbling on BTA4 was replicated in two families with significant linkages. The QTL for marbling on BTA6, 7, 9, 10, 20, and 21 and the QTL for body weight on BTA6 were replicated with 1% and/or 5% chromosome-wise significance levels. There were shared IBD Q or q haplotypes in the marbling QTL on BTA4, 6, and 10. The allele substitution effect of these haplotypes ranged from 0.7 to 1.2, and an additive effect between the marbling QTL on BTA6 and 10 was observed in the family examined. The abundant and replicated QTL information will enhance the opportunities for positional cloning of causative genes for the quantitative traits and efficient breeding using marker-assisted selection. Electronic Supplementary Material Electronic Supplementary material is available for this article at and accessible for authorised users.  相似文献   

19.
A deductive method of haplotype analysis in pedigrees.   总被引:13,自引:4,他引:9       下载免费PDF全文
Derivation of haplotypes from pedigree data by means of likelihood techniques requires large computational resources and is thus highly limited in terms of the complexity of problems that can be analyzed. The present paper presents 20 rules of logic that are both necessary and sufficient for deriving haplotypes by means of nonstatistical techniques. As a result, automated haplotype analysis that uses these rules is fast and efficient, requiring computer memory that increases only linearly (rather than exponentially) with family size and the number of factors under analysis. Some error analysis is also possible. The rules are completely general with regard to any system of completely linked, discrete genetic markers that are autosomally inherited. There are no limitations on pedigree structure or the amount of missing data, although the existence of incomplete data usually reduces the fraction of haplotypes that can be completely determined.  相似文献   

20.
BACKGROUND: A common genetic basis for IgA deficiency (IgAD) and common variable immunodeficiency (CVID) is suggested by their occurrence in members of the same family and the similarity of the underlying B cell differentiation defects. An association between IgAD/CVID and HLA alleles DR3, B8, and A1 has also been documented. In a search for the gene(s) in the major histocompatibility complex (MHC) that predispose to IgAD/CVID, we analyzed the extended MHC haplotypes present in a large family with 8 affected members. MATERIALS AND METHODS: We examined the CVID proband, 72 immediate relatives, and 21 spouses, and determined their serum immunoglobulin concentrations. The MHC haplotype analysis of individual family members employed 21 allelic DNA and protein markers, including seven newly available microsatellite markers. RESULTS: Forty-one (56%) of the 73 relatives by common descent were heterozygous and nine (12%) were homozygous for a fragment or the entire extended MHC haplotype designated haplotype 1 that included HLA- DR3, -C4A-0, -B8, and -A1. The remarkable prevalence of haplotype 1 was due in part to marital introduction into the family of 11 different copies of the haplotype, eight sharing 20 identical genotype markers between HLA-DR3 and HLA-B8, and three that contained fragments of haplotype 1. CONCLUSION: Crossover events within the MHC indicated a susceptibility locus for IgAD/CVID between the class III markers D821/D823 and HLA-B8, a region populated by 21 genes that include tumor necrosis factor alpha and lymphotoxins alpha and beta. Inheritance of at least this fragment of haplotype 1 appears to be necessary for the development of IgAD/CVID in this family.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号