首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Molecular markers produced by next‐generation sequencing (NGS) technologies are revolutionizing genetic research. However, the costs of analysing large numbers of individual genomes remain prohibitive for most population genetics studies. Here, we present results based on mathematical derivations showing that, under many realistic experimental designs, NGS of DNA pools from diploid individuals allows to estimate the allele frequencies at single nucleotide polymorphisms (SNPs) with at least the same accuracy as individual‐based analyses, for considerably lower library construction and sequencing efforts. These findings remain true when taking into account the possibility of substantially unequal contributions of each individual to the final pool of sequence reads. We propose the intuitive notion of effective pool size to account for unequal pooling and derive a Bayesian hierarchical model to estimate this parameter directly from the data. We provide a user‐friendly application assessing the accuracy of allele frequency estimation from both pool‐ and individual‐based NGS population data under various sampling, sequencing depth and experimental error designs. We illustrate our findings with theoretical examples and real data sets corresponding to SNP loci obtained using restriction site–associated DNA (RAD) sequencing in pool‐ and individual‐based experiments carried out on the same population of the pine processionary moth (Thaumetopoea pityocampa). NGS of DNA pools might not be optimal for all types of studies but provides a cost‐effective approach for estimating allele frequencies for very large numbers of SNPs. It thus allows comparison of genome‐wide patterns of genetic variation for large numbers of individuals in multiple populations.  相似文献   

2.
Pooling of DNA samples can significantly reduce the effort of population studies with DNA markers. I present a statistical model and numerical method for estimating gene frequency when pooled DNA is assayed for the presence/absence of alleles. Analytical and Monte‐Carlo methods examined estimation variance and bias, and hence optimal pool size, under a triangular allele frequency distribution. For gene frequency of rarer alleles, the optimal number of pooled individuals is approximately the inverse of the gene frequency. For heterozygosity, the optimal pool is approximately half the allele number; this results in pools containing, on average, 60% of possible alleles.  相似文献   

3.
Sequencing pooled DNA of multiple individuals from a population instead of sequencing individuals separately has become popular due to its cost-effectiveness and simple wet-lab protocol, although some criticism of this approach remains. Here we validated a protocol for pooled whole-genome re-sequencing (Pool-seq) of Arabidopsis lyrata libraries prepared with low amounts of DNA (1.6 ng per individual). The validation was based on comparing single nucleotide polymorphism (SNP) frequencies obtained by pooling with those obtained by individual-based Genotyping By Sequencing (GBS). Furthermore, we investigated the effect of sample number, sequencing depth per individual and variant caller on population SNP frequency estimates. For Pool-seq data, we compared frequency estimates from two SNP callers, VarScan and Snape; the former employs a frequentist SNP calling approach while the latter uses a Bayesian approach. Results revealed concordance correlation coefficients well above 0.8, confirming that Pool-seq is a valid method for acquiring population-level SNP frequency data. Higher accuracy was achieved by pooling more samples (25 compared to 14) and working with higher sequencing depth (4.1× per individual compared to 1.4× per individual), which increased the concordance correlation coefficient to 0.955. The Bayesian-based SNP caller produced somewhat higher concordance correlation coefficients, particularly at low sequencing depth. We recommend pooling at least 25 individuals combined with sequencing at a depth of 100× to produce satisfactory frequency estimates for common SNPs (minor allele frequency above 0.05).  相似文献   

4.
Using striped bass (Morone saxatilis) and six multiplexed microsatellite markers, we evaluated procedures for estimating allele frequencies by pooling DNA from multiple individuals, a method suggested as cost-effective relative to individual genotyping. Using moment-based estimators, we estimated allele frequencies in experimental DNA pools and found that the three primary laboratory steps, DNA quantitation and pooling, PCR amplification, and electrophoresis, accounted for 23, 48, and 29%, respectively, of the technical variance of estimates in pools containing DNA from 2-24 individuals. Exact allele-frequency estimates could be made for pools of sizes 2-8, depending on the locus, by using an integer-valued estimator. Larger pools of size 12 and 24 tended to yield biased estimates; however, replicates of these estimates detected allele frequency differences among pools with different allelic compositions. We also derive an unbiased estimator of Hardy-Weinberg disequilibrium coefficients that uses multiple DNA pools and analyze the cost-efficiency of DNA pooling. DNA pooling yields the most potential cost savings when a large number of loci are employed using a large number of individuals, a situation becoming increasingly common as microsatellite loci are developed in increasing numbers of taxa.  相似文献   

5.
The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using "pooled" data and compared them with "true" frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.  相似文献   

6.
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size.  相似文献   

7.
We describe a method for pooling and sequencing DNA from a large number of individual samples while preserving information regarding sample identity. DNA from 576 individuals was arranged into four 12 row by 12 column matrices and then pooled by row and by column resulting in 96 total pools with 12 individuals in each pool. Pooling of DNA was carried out in a two-dimensional fashion, such that DNA from each individual is present in exactly one row pool and exactly one column pool. By considering the variants observed in the rows and columns of a matrix we are able to trace rare variants back to the specific individuals that carry them. The pooled DNA samples were enriched over a 250 kb region previously identified by GWAS to significantly predispose individuals to lung cancer. All 96 pools (12 row and 12 column pools from 4 matrices) were barcoded and sequenced on an Illumina HiSeq 2000 instrument with an average depth of coverage greater than 4,000×. Verification based on Ion PGM sequencing confirmed the presence of 91.4% of confidently classified SNVs assayed. In this way, each individual sample is sequenced in multiple pools providing more accurate variant calling than a single pool or a multiplexed approach. This provides a powerful method for rare variant detection in regions of interest at a reduced cost to the researcher.  相似文献   

8.
Biallelic marker, most commonly single nucleotide polymorphism (SNP), is widely utilized in genetic association analysis, which can be speeded up by estimating allele frequency in pooled DNA instead of individual genotyping. Several methods have shown high accuracy and precision for allele frequency estimation in pools. Here, we explored PCR restriction fragment length polymorphism (PCR–RFLP) combined with microchip electrophoresis as a possible strategy for allele frequency estimation in DNA pools. We have used the commercial available Agilent 2100 microchip electrophoresis analysis system for quantifying the enzymatically digested DNA fragments and the fluorescence intensities to estimate the allele frequencies in the DNA pools. In this study, we have estimated the allele frequencies of five SNPs in a DNA pool composed of 141 previously genotyped health controls and a DNA pool composed of 96 previously genotyped gastric cancer patients with a frequency representation of 10–90% for the variant allele. Our studies show that accurate, quantitative data on allele frequencies, suitable for investigating the association of SNPs with complex disorders, can be estimated from pooled DNA samples by using this assay. This approach, being independent of the number of samples, promises to drastically reduce the labor and cost of genotyping in the initial association analysis.  相似文献   

9.
DNA 池结合DHPLC 和直接测序技术在江豚SNPs 检测中的应用   总被引:6,自引:0,他引:6  
选取江豚基因组中的2 个已知单核苷酸多态性(single nucleotide polymorphisms,SNPs)位点,通过PCR 扩增,将PCR 产物按基因频率不同制备成0 ~ 50% 的11 个DNA 池(DNA pool),用于变性高效液相色谱(denaturing high performance liquid chromatography,DHPLC)和直接测序分析,以探讨DNA 池中基因频率的最低要求。结果显示,当稀有等位基因的基因频率不少于5% 时可在DHPLC 检测过程中明显分辨;而利用DNA 池进行直接测序时的基因频率则需达到10% 。这提示,为保证DHPLC 分析的准确性和可靠性,制备DNA 池时等摩尔DNA 混合的个体数最好不超过10 个。DNA 池结合DHPLC 技术的高效性与准确性可在大规模的SNPs 位点筛选中发挥作用。  相似文献   

10.
One of the critical steps in the positional cloning of a complex disease gene involves association analysis between a phenotype and a set of densely spaced diallelic markers, typically single nucleotide repeats (SNPs), covering the region of interest. However, the effort and cost of detecting sufficient numbers of SNPs across relatively large physical distances represents a significant rate-limiting step. We have explored DNA pooling, in conjunction with denaturing high performance liquid chromatography (DHPLC), as a possible strategy for augmenting the efficiency, economy, and throughput of SNP detection. DHPLC is traditionally used to detect variants in polymerase chain reaction products containing both allelic forms of a polymorphism (e.g., heterozygotes or a 1:1 mix of both alleles) via heteroduplex separation and thereby requires separate analyses of multiple individual test samples. We have adapted this technology to identify variants in pooled DNA. To evaluate the utility and sensitivity of this approach, we constructed DNA pools comprised of 20 previously genotyped individuals with a frequency representation of 0%-50% for the variant allele. Mutation detection was performed by using temperature-modulated heteroduplex formation/DHPLC and dye-terminator sequencing. Using DHPLC, we could consistently detect SNPs at lower than 5% frequency, corresponding to the detection of one variant allele in a pool of 20 alleles. In contrast, fluorescent sequencing detected variants in the same pools only if the frequency of the less common allele was at least 10%. We conclude that DNA pooling of samples for DHPLC analysis is an effective way to increase throughput efficiency of SNP detection.  相似文献   

11.
Single-nucleotide polymorphisms (SNPs) are considered useful polymorphic markers for genetic studies of polygenic traits. A new practical approach to high-throughput genotyping of SNPs in a large number of individuals is needed in association study and other studies on relationships between genes and diseases. We have developed an accurate and high-throughput method for determining the allele frequencies by pooling the DNA samples and applying a DNA microarray hybridization analysis. In this method, the combination of the microarray, DNA pooling, probe pair hybridization, and fluorescent ratio analysis solves the dual problems of parallel multiple sample analysis, and parallel multiplex SNP genotyping for association study. Multiple DNA samples are immobilized on a slide and a single hybridization is performed with a pool of allele-specific oligonucleotide probes. The results of this study show that hybridization of microarray from pooled DNA samples can accurately obtain estimates of absolute allele frequencies in a sample pool. This method can also be used to identify differences in allele frequencies in distinct populations. It is amenable to automation and is suitable for immediate utilization for high-throughput genotyping of SNP.  相似文献   

12.
While genome-wide association studies (GWAS) have been successful in identifying a large number of variants associated with disease, the challenge of locating the underlying causal loci remains. Sequencing of case and control DNA pools provides an inexpensive method for assessing all variation in a genomic region surrounding a significant GWAS result. However, individual variants need to be ranked in terms of the strength of their association to disease in order to prioritise follow-up by individual genotyping. A simple method for testing for case-control association in sequence data from DNA pools is presented that allows the partitioning of the variance in allele frequency estimates into components due to the sampling of chromosomes from the pool during sequencing, sampling individuals from the population and unequal contribution from individuals during pool construction. The utility of this method is demonstrated on a sequence from the alcohol dehydrogenase (ADH) gene cluster on a case-control sample for heavy alcohol consumption.  相似文献   

13.
Next Generation Sequencing (NGS) has revolutionized biomedical research in recent years. It is now commonly used to identify rare variants through resequencing individual genomes. Due to the cost of NGS, researchers have considered pooling samples as a cost-effective alternative to individual sequencing. In this article, we consider the estimation of allele frequencies of rare variants through the NGS technologies with pooled DNA samples with or without barcodes. We consider three methods for estimating allele frequencies from such data, including raw sequencing counts, inferred genotypes, and expected minor allele counts, and compare their performance. Our simulation results suggest that the estimator based on inferred genotypes overall performs better than or as well as the other two estimators. When the sequencing coverage is low, biases and MSEs can be sensitive to the choice of the prior probabilities of genotypes for the estimators based on inferred genotypes and expected minor allele counts so that more accurate specification of prior probabilities is critical to lower biases and MSEs. Our study shows that the optimal number of barcodes in a pool is relatively robust to the frequencies of rare variants at a specific coverage depth. We provide general guidelines on using DNA pooling with barcoding for the estimation of allele frequencies of rare variants.  相似文献   

14.
DNA pooling is a potential methodology for genetic loci with small effect contributing to complex diseases and quantitative traits. This is accomplished by the rapid preliminary screening of the genome for the allelic association with the most common class of polymorphic short tandem repeat markers. The methodology assumes as a common founder for the linked disease locus of interest and searches for a region of a chromosome shared between affected individuals. The general theory of DNA pooling basically relies on the observed differences in the allelic distribution between pools from affected and unaffected individuals, including a reduction in the number of alleles in the affected pool, which indicate the sharing of a chromosomal region. The power of statistic for associated linkage mapping can be determined using two recently developed strategies, firstly, by measuring the differences of allelic image patterns produced by two DNA pools of extreme character and secondly, by measuring total allele content differences by comparing between two pools containing large numbers of DNA samples. These strategies have effectively been utilized to identify the shared chromosomal regions for linkage studies and to investigate the candidate disease loci for fine structure gene mapping using allelic association. This paper outlines the utilization of DNA pooling as a potential tool to locate the complex disease loci, statistical methods for accurate estimates of allelic frequencies from DNA pools, its advantages, drawbacks and significance in associate linkage mapping using pooled DNA samples.  相似文献   

15.
Owing to rapid advances in the next-generation sequencing technology, the cost of DNA sequencing has been reduced by over several orders of magnitude. However, genomic sequencing of individuals at the population scale is still restricted to a few model species due to the huge challenge of constructing libraries for thousands of samples. Meanwhile, pooled sequencing provides a cost-effective alternative to sequencing individuals separately, which could vastly reduce the time and cost for DNA library preparation. Technological improvements, together with the broad range of biological research questions that require large sample sizes, mean that pooled sequencing will continue to complement the sequencing of individual genomes and become increasingly important in the foreseeable future. However, simply mixing samples together for sequencing makes it impossible to identify reads that belongs to each sample. Barcoding technology could help to solve this problem, nonetheless, currently, barcoding every sample is costly especially for large-scale samples. An alternative to barcoding is combinatorial pooled sequencing which employs pooling pattern rather than short DNA barcodes to encode each sample. In combinatorial pooled sequencing, samples are mixed into few pools according to a carefully designed pooling strategy which allows the sequencing data to be decoded to identify the reads that belongs to the sample that are unique or rare in the population. In this review, we mainly survey the experiment design and decoding procedure for the combinatorial pooled sequencing applied in rare variant and rare haplotype carriers screening, complex genome assembling and single individual haplotyping.  相似文献   

16.
Next generation sequencing (NGS) is about to revolutionize genetic analysis. Currently NGS techniques are mainly used to sequence individual genomes. Due to the high sequence coverage required, the costs for population-scale analyses are still too high to allow an extension to nonmodel organisms. Here, we show that NGS of pools of individuals is often more effective in SNP discovery and provides more accurate allele frequency estimates, even when taking sequencing errors into account. We modify the population genetic estimators Tajima''s π and Watterson''s θ to obtain unbiased estimates from NGS pooling data. Given the same sequencing effort, the resulting estimators often show a better performance than those obtained from individual sequencing. Although our analysis also shows that NGS of pools of individuals will not be preferable under all circumstances, it provides a cost-effective approach to estimate allele frequencies on a genome-wide scale.NEXT generation sequencing (NGS) is about to revolutionize biology. Through a massive parallelization, NGS provides an enormous number of reads, which permits sequencing of entire genomes at a fraction of the costs for Sanger sequencing. Hence, for the first time it has become feasible to obtain the complete genomic sequence for a large number of individuals. For several organisms, including humans, Drosophila melanogaster, and Arabidopsis thaliana, large resequencing projects are well on their way. Nevertheless, despite the enormous cost reduction, genome sequencing on a population scale is still out of reach for the budget of most laboratories. The extraction of as much statistical information as possible at cost as low as possible has therefore already attracted considerable interest. See, for instance, Jiang et al. (2009) for the modeling of sequencing errors and Erlich et al. (2009) for the efficient tagging of sequences.Current genome-wide resequencing projects collect the sequences individual by individual. To obtain full coverage of the entire genome and to have high confidence that all heterozygous sites were discovered, it is required that genomes are sequenced at a sufficiently high coverage. As many of the reads provide only redundant information, cost could be reduced by a more effective sampling strategy.In this report, we explore the potential of DNA pooling to provide a more cost-effective approach for SNP discovery and genome-wide population genetics. Sequencing a large pool of individuals simultaneously keeps the number of redundant DNA reads low and provides thus an economic alternative to the sequencing of individual genomes. On the other hand, more care has to be taken to establish an appropriate control of sequencing errors. Obviously haplotype information is not available from pooling experiments, but this will often be outweighed by the increased accuracy in population genetic inference.Focusing on biallelic loci, our analysis shows that with sufficiently large pool sizes, pooling usually outperforms the separate sequencing of individuals, both for estimating allele frequencies and for inference of population genetic parameters. When sequencing errors are not too common, pooling seems also to be a good choice for SNP detection experiments. To avoid the additional challenges encountered with individual sequencing of diploid individuals, we compare pooling with individual sequencing of haploid individuals. See Lynch (2008, 2009) for a discussion of next generation sequencing of diploid individuals. Our results for the pooling experiments should be also applicable to a diploid setting, as we are just merging pools of size 2 to a larger pool in this case, leading to a pool size of n = 2nd for nd diploid individuals. In the methods section, we derive several mathematical expressions that permit us to compare pooling with separate sequencing of individuals. These formulas are then applied in the results section to illustrate the differences in accuracy between the approaches. A reader who is interested only in the actual differences under several scenarios might therefore want to move directly to the results section.  相似文献   

17.
Recombination is the exchange of genetic material between homologous chromosomes via physical crossovers. High-throughput sequencing approaches detect crossovers genome wide to produce recombination rate maps but are difficult to scale as they require large numbers of recombinants individually sequenced. We present a simple and scalable pooled-sequencing approach to experimentally infer near chromosome-wide recombination rates by taking advantage of non-Mendelian allele frequency generated from a fitness differential at a locus under selection. As more crossovers decouple the selected locus from distal loci, the distorted allele frequency attenuates distally toward Mendelian and can be used to estimate the genetic distance. Here, we use marker selection to generate distorted allele frequency and theoretically derive the mathematical relationships between allele frequency attenuation, genetic distance, and recombination rate in marker-selected pools. We implemented nonlinear curve-fitting methods that robustly estimate the allele frequency decay from batch sequencing of pooled individuals and derive chromosome-wide genetic distance and recombination rates. Empirically, we show that marker-selected pools closely recapitulate genetic distances inferred from scoring recombinants. Using this method, we generated novel recombination rate maps of three wild-derived strains of Drosophila melanogaster, which strongly correlate with previous measurements. Moreover, we show that this approach can be extended to estimate chromosome-wide crossover interference with reciprocal marker selection and discuss how it can be applied in the absence of visible markers. Altogether, we find that our method is a simple and cost-effective approach to generate chromosome-wide recombination rate maps requiring only one or two libraries.  相似文献   

18.
Brookmeyer R 《Biometrics》1999,55(2):608-612
The testing of pooled samples of biological specimens for the purpose of estimating disease prevalence may be more cost effective than testing individual samples, particularly if the prevalence of disease is low. Multistage pooling studies involve testing pools and then sequentially subdividing and testing the positive pools. A simple estimator of disease prevalence and its variance are derived for general multistage pooling studies and are shown to be natural generalizations of Thompson's (1962) original estimators for single-stage pooling studies. The reduction in variance associated with each additional stage is calibrated. The results are extended to estimating disease incidence rates. The methods are used to estimate HIV incidence rates from a prevalence study of early HIV infection using a PCR assay for HIV RNA.  相似文献   

19.
We assume that allele frequency data have been extracted from several large DNA pools, each containing genetic material of up to hundreds of sampled individuals. Our goal is to estimate the haplotype frequencies among the sampled individuals by combining the pooled allele frequency data with prior knowledge about the set of possible haplotypes. Such prior information can be obtained, for example, from a database such as HapMap. We present a Bayesian haplotyping method for pooled DNA based on a continuous approximation of the multinomial distribution. The proposed method is applicable when the sizes of the DNA pools and/or the number of considered loci exceed the limits of several earlier methods. In the example analyses, the proposed model clearly outperforms a deterministic greedy algorithm on real data from the HapMap database. With a small number of loci, the performance of the proposed method is similar to that of an EM-algorithm, which uses a multinormal approximation for the pooled allele frequencies, but which does not utilize prior information about the haplotypes. The method has been implemented using Matlab and the code is available upon request from the authors.  相似文献   

20.
To evaluate the ability to use DNA pools with the Illumina Infinium genotyping platform, two sets of gradient pools were created using two pairs of highly inbred chicken lines. Replicate pools containing 0%, 10%, 20%, 40%, 60%, 80%, 90% and 100% of DNA from line A vs. B or line C vs. D were created, for a total of 28 pools. All pools were genotyped for 12 046 SNPs. Three frequency estimation methods proposed in the literature (standard, heterozygote‐corrected and normalized) were compared with three alternate methods proposed herein based on mean square error (MSE), bias and variance of estimated vs. true allele frequencies and the fit of regression of estimated on true frequencies. The three new methods had average square root MSE of 4.6%, 4.6% and 4.7% compared to 5.2%, 5.5% and 11.2% for the three literature methods. Average absolute biases of the literature methods were 2.4%, 2.7% and 8.2% compared to 2.4% for all new methods. Standard deviations of estimates were also smaller for the new methods, at 3.1%, 3.2% and 3.2% compared to 3.5%, 4.0% and 5.0% for previously reported methods. In conclusion, intensity data from the Illumina Infinium Assay can be efficiently used to estimate allele frequencies in pools, in particular using any of the new methods proposed herein.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号