首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Existing methods for identity by descent (IBD) segment detection were designed for SNP array data, not sequence data. Sequence data have a much higher density of genetic variants and a different allele frequency distribution, and can have higher genotype error rates. Consequently, best practices for IBD detection in SNP array data do not necessarily carry over to sequence data. We present a method, IBDseq, for detecting IBD segments in sequence data and a method, SEQERR, for estimating genotype error rates at low-frequency variants by using detected IBD. The IBDseq method estimates probabilities of genotypes observed with error for each pair of individuals under IBD and non-IBD models. The ratio of estimated probabilities under the two models gives a LOD score for IBD. We evaluate several IBD detection methods that are fast enough for application to sequence data (IBDseq, Beagle Refined IBD, PLINK, and GERMLINE) under multiple parameter settings, and we show that IBDseq achieves high power and accuracy for IBD detection in sequence data. The SEQERR method estimates genotype error rates by comparing observed and expected rates of pairs of homozygote and heterozygote genotypes at low-frequency variants in IBD segments. We demonstrate the accuracy of SEQERR in simulated data, and we apply the method to estimate genotype error rates in sequence data from the UK10K and 1000 Genomes projects.  相似文献   

2.
The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries. However, WGS introduces new complexities that must be addressed in order to achieve these improvements in relationship detection. To evaluate these complexities, we estimated genetic relationships from WGS data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets. To address this issue, we developed a new method to identify and mask genomic regions with excess IBD. This method, implemented in ERSA 2.0, fully resolved the inflated cryptic relationship detection rates while improving relationship estimation accuracy. ERSA 2.0 detected all 1st through 6th degree relationships, and 55% of 9th through 11th degree relationships in the 30 families. We estimate that WGS data provides a 5% to 15% increase in relationship detection power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are highly problematic for IBD mapping and introduce new software to accurately detect 1st through 9th degree relationships from whole-genome sequence data.  相似文献   

3.
We present a method, fastIBD, for finding tracts of identity by descent (IBD) between pairs of individuals. FastIBD can be applied to thousands of samples across genome-wide SNP data and is significantly more powerful for finding short tracts of IBD than existing methods for finding IBD tracts in such data. We show that fastIBD can detect facets of population structure that are not revealed by other methods. In the Wellcome Trust Case Control Consortium bipolar disorder case-control data, we find a genome-wide excess of IBD in case-case pairs of individuals compared to control-control pairs. We show that this excess can be explained by the geographical clustering of cases. We also show that it is possible to use fastIBD to generate highly accurate estimates of genome-wide IBD sharing between pairs of distant relatives. This is useful for estimation of relationship and for adjusting for relatedness in association studies. FastIBD is incorporated in the freely available Beagle software package.  相似文献   

4.

Background

Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, next-generation DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.

Results

Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.

Conclusion

We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.  相似文献   

5.
The detection of genetic segments of Identical by Descent (IBD) in Genome-Wide Association Studies has proven successful in pinpointing genetic relatedness between reportedly unrelated individuals and leveraging such regions to shortlist candidate genes. These techniques depend on high-density genotyping arrays and their effectiveness in diverse sequence data is largely unknown. Due to decreasing costs and increasing effectiveness of high throughput techniques for whole-exome sequencing, an influx of exome sequencing data has become available. Studies using exomes and IBD-detection methods within known pedigrees have shown that IBD can be useful in finding hidden genetic candidates where known relatives are available. We set out to examine the viability of using IBD-detection in whole exome sequencing data in population-wide studies. In doing so, we extend GERMLINE, a method to detect IBD from exome sequencing data by finding small slices of matching alleles between pairs of individuals and extending them into full IBD segments. This algorithm allows for efficient population-wide detection in dense data. We apply this algorithm to a cohort of Crohn''s Disease cases where whole-exome and GWAS array data is available. We confirm that GWAS-based detected segments are highly accurate and predictive of underlying shared variation. Where segments inferred from GWAS are expected to be of high accuracy, we compare exome-based detection accuracy of multiple detection strategies. We find detection accuracy to be prohibitively low in all assessments, both in terms of segment sensitivity and specificity. Even after isolating relatively long segments beyond 10cM, exome-based detection continued to offer poor specificity/sensitivity tradeoffs. We hypothesize that the variable coverage and platform biases of exome capture account for this decreased accuracy and look toward whole genome sequencing data as a higher quality source for detecting population-wide IBD.  相似文献   

6.
Browning SR  Thompson EA 《Genetics》2012,190(4):1521-1531
Identity-by-descent (IBD) mapping tests whether cases share more segments of IBD around a putative causal variant than do controls. These segments of IBD can be accurately detected from genome-wide SNP data. We investigate the power of IBD mapping relative to that of SNP association testing for genome-wide case-control SNP data. Our focus is particularly on rare variants, as these tend to be more recent and hence more likely to have recent shared ancestry. We simulate data from both large and small populations and find that the relative performance of IBD mapping and SNP association testing depends on population demographic history and the strength of selection against causal variants. We also present an IBD mapping analysis of a type 1 diabetes data set. In those data we find that we can detect association only with the HLA region using IBD mapping. Overall, our results suggest that IBD mapping may have higher power than association analysis of SNP data when multiple rare causal variants are clustered within a gene. However, for outbred populations, very large sample sizes may be required for genome-wide significance unless the causal variants have strong effects.  相似文献   

7.
MOTIVATION: The technology to genotype single nucleotide polymorphisms (SNPs) at extremely high densities provides for hypothesis-free genome-wide scans for common polymorphisms associated with complex disease. However, we find that some errors introduced by commonly employed genotyping algorithms may lead to inflation of false associations between markers and phenotype. RESULTS: We have developed a novel SNP genotype calling program, SNiPer-High Density (SNiPer-HD), for highly accurate genotype calling across hundreds of thousands of SNPs. The program employs an expectation-maximization (EM) algorithm with parameters based on a training sample set. The algorithm choice allows for highly accurate genotyping for most SNPs. Also, we introduce a quality control metric for each assayed SNP, such that poor-behaving SNPs can be filtered using a metric correlating to genotype class separation in the calling algorithm. SNiPer-HD is superior to the standard dynamic modeling algorithm and is complementary and non-redundant to other algorithms, such as BRLMM. Implementing multiple algorithms together may provide highly accurate genotyping calls, without inflation of false positives due to systematically miss-called SNPs. A reliable and accurate set of SNP genotypes for increasingly dense panels will eliminate some false association signals and false negative signals, allowing for rapid identification of disease susceptibility loci for complex traits. AVAILABILITY: SNiPer-HD is available at TGen's website: http://www.tgen.org/neurogenomics/data.  相似文献   

8.
Fu & Curnow (1990) derive recursive equations to find thelevel of significance and power of a likelihood ratio test fora changed segment of specified length, based on the scan statistic,the maximum number of successes within the specified length.Their method is computationally feasible for segment lengthsof 20 or less. We present and evaluate highly accurate approximationsas well as bounds for the power function of this test that arecomputationally feasible even for very large segment lengths.We also evaluate power when the duration of the increased lengthused in the test statistic does not correspond to the actuallength.  相似文献   

9.
10.
Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows–Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021).  相似文献   

11.
The availability of very dense genetic maps is changing in a fundamental way the methods used to identify the genetic basis of both rare and common inherited traits. The ability to directly compare the genomes of two related individuals and quickly identify those regions that are inherited identical-by-descent (IBD) from a recent common ancestor would be of utility in a wide range of genetic mapping methods. Here, we describe a simple method for using dense SNP maps to identify regions of the genome likely to be inherited IBD by family members. This method is based on identifying obligate recombination events and examining the pattern of distribution of such events along the genetic map. Specifically, we use the length of a consecutive set of biallelic markers that have a high probability of having avoided such obligate recombination events. This ;;SNP streak" is derived from subsets of samples within a pedigree and allows us to make statistical inferences about the ancestry of the region(s) containing stretches of markers with these properties. We show that the use of subsets of more than two samples has the advantage of identifying shorter shared subsegments as significant. This mitigates the effects of errors in SNP calls. We provide specific examples of microarray-based SNP data, using a family with a complex pedigree and with a rare form of inherited kidney disease, to illustrate this approach.  相似文献   

12.
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.  相似文献   

13.
Most molecular measures of inbreeding do not measure inbreeding at the scale that is most relevant for understanding inbreeding depression—namely the proportion of the genome that is identical-by-descent (IBD). The inbreeding coefficient FPed obtained from pedigrees is a valuable estimator of IBD, but pedigrees are not always available, and cannot capture inbreeding loops that reach back in time further than the pedigree. We here propose a molecular approach to quantify the realized proportion of the genome that is IBD (propIBD), and we apply this method to a wild and a captive population of zebra finches (Taeniopygia guttata). In each of 948 wild and 1057 captive individuals we analyzed available single-nucleotide polymorphism (SNP) data (260 SNPs) spread over four different genomic regions in each population. This allowed us to determine whether any of these four regions was completely homozygous within an individual, which indicates IBD with high confidence. In the highly nomadic wild population, we did not find a single case of IBD, implying that inbreeding must be extremely rare (propIBD=0–0.00094, 95% CI). In the captive population, a five-generation pedigree strongly underestimated the average amount of realized inbreeding (FPed=0.013<propIBD=0.064), as expected given that pedigree founders were already related. We suggest that this SNP-based technique is generally useful for quantifying inbreeding at the individual or population level, and we show analytically that it can capture inbreeding loops that reach back up to a few hundred generations.  相似文献   

14.
Brown MD  Glazner CG  Zheng C  Thompson EA 《Genetics》2012,190(4):1447-1460
In both pedigree linkage studies and in population-based association studies there has been much interest in the use of modern dense genetic marker data to infer segments of gene identity by descent (ibd) among individuals not known to be related, to increase power and resolution in localizing genes affecting complex traits. In this article, we present a hidden Markov model (HMM) for ibd among a set of chromosomes and describe methods and software for inference of ibd among the four chromosomes of pairs of individuals, using either phased (haplotypic) or unphased (genotypic) data. The model allows for missing data and typing error, but does not model linkage disequilibrium (LD), because fitting an accurate LD model requires large samples from well-studied populations. However, LD remains a major confounding factor, since LD is itself a reflection of coancestry at the population level. To study the impact of LD, we have developed a novel simulation approach to generate realistic dense marker data for the same set of markers but at varying levels of LD. Using this approach, we present results of a study of the impact of LD on the sensitivity and specificity of our HMM model in estimating segments of ibd among sets of four chromosomes and between genotype pairs. We show that, despite not incorporating LD, our model has been quite successful in detecting segments as small as 10(6) bp (1 Mpb); we present also comparisons with fastIBD which uses an LD model in estimating ibd.  相似文献   

15.
The genus Vitis (the grapevine) is a group of highly diverse, diploid woody perennial vines consisting of approximately 60 species from across the northern hemisphere. It is the world’s most valuable horticultural crop with ~8 million hectares planted, most of which is processed into wine. To gain insights into the use of wild Vitis species during the past century of interspecific grape breeding and to provide a foundation for marker-assisted breeding programmes, we present a principal components analysis (PCA) based ancestry estimation method to calculate admixture proportions of hybrid grapes in the United States Department of Agriculture grape germplasm collection using genome-wide polymorphism data. We find that grape breeders have backcrossed to both the domesticated V. vinifera and wild Vitis species and that reasonably accurate genome-wide ancestry estimation can be performed on interspecific Vitis hybrids using a panel of fewer than 50 ancestry informative markers (AIMs). We compare measures of ancestry informativeness used in selecting SNP panels for two-way admixture estimation, and verify the accuracy of our method on simulated populations of admixed offspring. Our method of ancestry deconvolution provides a first step towards selection at the seed or seedling stage for desirable admixture profiles, which will facilitate marker-assisted breeding that aims to introgress traits from wild Vitis species while retaining the desirable characteristics of elite V. vinifera cultivars.  相似文献   

16.
Yuan A  Chen G  Chen Y  Rotimi C  Bonney GE 《Genetics》2004,167(3):1445-1459
There are generally three steps to isolate a disease linkage-susceptibility gene: genome-wide scan, fine mapping, and, last, positional cloning. The last step is time consuming and involves intensive laboratory work. In some cases, fine mapping cannot proceed further on a set of markers because they are tightly linked. For years, genetic statisticians have been trying different ways to narrow the fine-mapping results to provide some guidance for the next step of laboratory work. Although these methods are practical and efficient, most of them are based on IBD data, which usually can be inferred only from the genotype data with some uncertainty. The corresponding methods thus have no greater power than one using genotype data directly. Also, IBD-based methods apply only to relative pair data. Here, using genotype data, we have developed a statistical hypothesis-testing method to pinpoint a SNP, or SNPs, suspected of responsibility for a disease trait linkage among a set of SNPs tightly linked in a region. Our method uses genotype data of affected individuals or case-control studies, which are widely available in the laboratory. The testing statistic can be constructed using any genotype-based disease-marker disequilibrium measure and is asymptotically distributed as a chi-square mixture. This method can be used for singleton data, relative pair data, or general pedigree data. We have applied the method to simulated data as well as a real data set; it gives satisfactory results.  相似文献   

17.
Genomic measures of inbreeding based on identical-by-descent (IBD) segments are increasingly used to measure inbreeding and mostly estimated on SNP arrays and whole-genome sequencing (WGS) data. However, some softwares recurrently used for their estimation assume that genomic positions which have not been genotyped are nonvariant. This might be true for WGS data, but not for reduced genomic representations and can lead to spurious IBD segments estimation. In this project, we simulated the outputs of WGS, two SNP arrays of different sizes and RAD-sequencing for three populations with different sizes and histories. We compare the results of IBD segments estimation with two softwares: runs of homozygosity (ROHs) estimated with PLINK and homozygous-by-descent (HBD) segments estimated with RZooRoH. We demonstrate that to obtain meaningful estimates of inbreeding, RZooRoH requires a SNPs density 11 times smaller compared to PLINK: ranks of inbreeding coefficients were conserved among individuals above 22 SNPs/Mb for PLINK and 2 SNPs/Mb for RZooRoH. We also show that in populations with simple demographic histories, distribution of ROHs and HBD segments are correctly estimated with both SNP arrays and WGS. PLINK correctly estimated distribution of ROHs with SNP densities above 22 SNPs/Mb, while RZooRoH correctly estimated distribution of HBD segments with SNPs densities above 11 SNPs/Mb. However, in a population with a more complex demographic history, RZooRoH resulted in better distribution of IBD segments estimation compared to PLINK even with WGS data. Consequently, we advise researchers to use either methods relying on excess homozygosity averaged across SNPs or model-based HBD segments calling methods for inbreeding estimations.  相似文献   

18.
Traditional fine-mapping approaches in mouse genetics that go from a linkage region to a candidate gene are very costly and time consuming. Shared ancestry regions, along with the combination of genetics and genomics approaches, provide a powerful tool to shorten the time and effort required to identify a causative gene. In this article we present a novel methodology that predicts IBD (identical by descent) regions between pairs of inbred strains using single nucleotide polymorphism (SNP) maps. We have validated this approach by comparing the IBD regions, estimated using different algorithms, to the results derived using the sequence information in the strains present in the Celera Mouse Database. We showed that based on the current publicly available SNP genotypes, large IBD regions (>1 Mb) can be identified successfully. By assembling a list of 21,514 SNPs in 61 common inbred strains, we inferred IBD regions between all pairs of strains and confirmed, for the first time, that existing quantitative trait genes (QTG) and susceptibility genes all lie outside of IBD regions. We also illustrated how knowledge of IBD structures can be applied to strain selection for future crosses. We have made our results available for data mining and download through a public website ( ). Electronic Supplementary Material Electronic Supplementary material is available for this article at and accessible for authorised users.  相似文献   

19.
We have created a high-density SNP resource encompassing 7.87 million polymorphic loci across 49 inbred mouse strains of the laboratory mouse by combining data available from public databases and training a hidden Markov model to impute missing genotypes in the combined data. The strong linkage disequilibrium found in dense sets of SNP markers in the laboratory mouse provides the basis for accurate imputation. Using genotypes from eight independent SNP resources, we empirically validated the quality of the imputed genotypes and demonstrated that they are highly reliable for most inbred strains. The imputed SNP resource will be useful for studies of natural variation and complex traits. It will facilitate association study designs by providing high-density SNP genotypes for large numbers of mouse strains. We anticipate that this resource will continue to evolve as new genotype data become available for laboratory mouse strains. The data are available for bulk download or query at /. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

20.
Information on genetic relationships among individuals is essential to many studies of the behaviour and ecology of wild organisms. Parentage and relatedness assays based on large numbers of single nucleotide polymorphism (SNP) loci hold substantial advantages over the microsatellite markers traditionally used for these purposes. We present a double‐digest restriction site‐associated DNA sequencing (ddRAD‐seq) analysis pipeline that, as such, simultaneously achieves the SNP discovery and genotyping steps and which is optimized to return a statistically powerful set of SNP markers (typically 150–600 after stringent filtering) from large numbers of individuals (up to 240 per run). We explore the trade‐offs inherent in this approach through a set of experiments in a species with a complex social system, the variegated fairy‐wren (Malurus lamberti) and further validate it in a phylogenetically broad set of other bird species. Through direct comparisons with a parallel data set from a robust panel of highly variable microsatellite markers, we show that this ddRAD‐seq approach results in substantially improved power to discriminate among potential relatives and considerably more precise estimates of relatedness coefficients. The pipeline is designed to be universally applicable to all bird species (and with minor modifications to many other taxa), to be cost‐ and time‐efficient, and to be replicable across independent runs such that genotype data from different study periods can be combined and analysed as field samples are accumulated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号