首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The feasibility to sequence entire genomes of virtually any organism provides unprecedented insights into the evolutionary history of populations and species. Nevertheless, many population genomic inferences – including the quantification and dating of admixture, introgression and demographic events, and inference of selective sweeps – are still limited by the lack of high‐quality haplotype information. The newest generation of sequencing technology now promises significant progress. To establish the feasibility of haplotype‐resolved genome resequencing at population scale, we investigated properties of linked‐read sequencing data of songbirds of the genus Oenanthe across a range of sequencing depths. Our results based on the comparison of downsampled (25×, 20×, 15×, 10×, 7×, and 5×) with high‐coverage data (46–68×) of seven bird genomes mapped to a reference suggest that phasing contiguities and accuracies adequate for most population genomic analyses can be reached already with moderate sequencing effort. At 15× coverage, phased haplotypes span about 90% of the genome assembly, with 50% and 90% of phased sequences located in phase blocks longer than 1.25–4.6 Mb (N50) and 0.27–0.72 Mb (N90). Phasing accuracy reaches beyond 99% starting from 15× coverage. Higher coverages yielded higher contiguities (up to about 7 Mb/1 Mb [N50/N90] at 25× coverage), but only marginally improved phasing accuracy. Phase block contiguity improved with input DNA molecule length; thus, higher‐quality DNA may help keeping sequencing costs at bay. In conclusion, even for organisms with gigabase‐sized genomes like birds, linked‐read sequencing at moderate depth opens an affordable avenue towards haplotype‐resolved genome resequencing at population scale.  相似文献   

2.
? Premise of the study: Next-generation sequencing (NGS) technologies are frequently used for resequencing and mining of single nucleotide polymorphisms (SNPs) by comparison to a reference genome. In crop species such as chickpea (Cicer arietinum) that lack a reference genome sequence, NGS-based SNP discovery is a challenge. Therefore, unlike probability-based statistical approaches for consensus calling and by comparison with a reference sequence, a coverage-based consensus calling (CbCC) approach was applied and two genotypes were compared for SNP identification. ? Methods: A CbCC approach is used in this study with four commonly used short read alignment tools (Maq, Bowtie, Novoalign, and SOAP2) and 15.7 and 22.1 million Illumina reads for chickpea genotypes ICC4958 and ICC1882, together with the chickpea trancriptome assembly (CaTA). ? Key results: A nonredundant set of 4543 SNPs was identified between two chickpea genotypes. Experimental validation of 224 randomly selected SNPs showed superiority of Maq among individual tools, as 50.0% of SNPs predicted by Maq were true SNPs. For combinations of two tools, greatest accuracy (55.7%) was reported for Maq and Bowtie, with a combination of Bowtie, Maq, and Novoalign identifying 61.5% true SNPs. SNP prediction accuracy generally increased with increasing reads depth. ? Conclusions: This study provides a benchmark comparison of tools as well as read depths for four commonly used tools for NGS SNP discovery in a crop species without a reference genome sequence. In addition, a large number of SNPs have been identified in chickpea that would be useful for molecular breeding.  相似文献   

3.
Restriction‐site associated DNA sequencing (RADSeq) facilitates rapid generation of thousands of genetic markers at relatively low cost; however, several sources of error specific to RADSeq methods often lead to biased estimates of allele frequencies and thereby to erroneous population genetic inference. Estimating the distribution of sample allele frequencies without calling genotypes was shown to improve population inference from whole genome sequencing data, but the ability of this approach to account for RADSeq‐specific biases remains unexplored. Here we assess in how far genotype‐free methods of allele frequency estimation affect demographic inference from empirical RADSeq data. Using the well‐studied pied flycatcher (Ficedula hypoleuca) as a study system, we compare allele frequency estimation and demographic inference from whole genome sequencing data with that from RADSeq data matched for samples using both genotype‐based and genotype free methods. The demographic history of pied flycatchers as inferred from RADSeq data was highly congruent with that inferred from whole genome resequencing (WGS) data when allele frequencies were estimated directly from the read data. In contrast, when allele frequencies were derived from called genotypes, RADSeq‐based estimates of most model parameters fell outside the 95% confidence interval of estimates derived from WGS data. Notably, more stringent filtering of the genotype calls tended to increase the discrepancy between parameter estimates from WGS and RADSeq data, respectively. The results from this study demonstrate the ability of genotype‐free methods to improve allele frequency spectrum‐ (AFS‐) based demographic inference from empirical RADSeq data and highlight the need to account for uncertainty in NGS data regardless of sequencing method.  相似文献   

4.
5.
Genome-wide single-nucleotide polymorphisms (SNPs) are highly useful in unraveling genetic insights and are essential to accelerate selections for genetic improvement in tobacco. The discovery of genome-wide SNPs in tobacco is very complex due to its high level of repetitive genome and polyploidy. At present, publicly available genomic data on SNPs are very limited, which warrants the need for high-throughput SNPs for application in tobacco breeding. In this research paper, we describe our efforts on SNP discovery by whole genome resequencing of 18 flue-cured Virginia (FCV) tobacco genotypes and annotation of SNPs in the tobacco genome. A large amount of data of about 225 GB per genotype was generated, with an average read depth of 50× using paired-end next-generation sequencing (NGS) with the HiSeq 2500 platform. The discovery of a large number of SNPs and indels was attempted to assist mapping and, thus, the selection processes to develop superior tobacco breeding lines. Discovered SNPs, their functional annotation, mapping to the reference genome, and their relative positioning in the linkage group are discussed in this paper.  相似文献   

6.
7.
Whole genome sequences (WGS) greatly increase our ability to precisely infer population genetic parameters, demographic processes, and selection signatures. However, WGS may still be not affordable for a representative number of individuals/populations. In this context, our goal was to assess the efficiency of several SNP genotyping strategies by testing their ability to accurately estimate parameters describing neutral diversity and to detect signatures of selection. We analysed 110 WGS at 12× coverage for four different species, i.e., sheep, goats and their wild counterparts. From these data we generated 946 data sets corresponding to random panels of 1K to 5M variants, commercial SNP chips and exome capture, for sample sizes of five to 48 individuals. We also extracted low‐coverage genome resequencing of 1×, 2× and 5× by randomly subsampling reads from the 12× resequencing data. Globally, 5K to 10K random variants were enough for an accurate estimation of genome diversity. Conversely, commercial panels and exome capture displayed strong ascertainment biases. Besides the characterization of neutral diversity, the detection of the signature of selection and the accurate estimation of linkage disequilibrium (LD) required high‐density panels of at least 1M variants. Finally, genotype likelihoods increased the quality of variant calling from low coverage resequencing but proportions of incorrect genotypes remained substantial, especially for heterozygote sites. Whole genome resequencing coverage of at least 5× appeared to be necessary for accurate assessment of genomic variations. These results have implications for studies seeking to deploy low‐density SNP collections or genome scans across genetically diverse populations/species showing similar genetic characteristics and patterns of LD decay for a wide variety of purposes.  相似文献   

8.
The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available reference genomes. Therefore, a common practice is to align reads to the genome of an organism related to the target species; however, this could affect read alignment and bias genotyping. In this study, I conducted an experiment using empirical RADseq datasets generated for two species of salmonids (Actinopterygii; Teleostei; Salmonidae) to address these questions. There are currently reference genomes for six salmonids of varying phylogenetic distance. I aligned the RADseq data to all six genomes and identified variants with several different genotypers, which were then fed into population genetic analyses. Increasing phylogenetic distance between target species and reference genome reduced the proportion of reads that successfully aligned and mapping quality. Reference genome also influenced the number of SNPs that were generated and depth at those SNPs, although the affect varied by genotyper. Inferences of population structure were mixed: increasing reference genome divergence reduced estimates of differentiation but similar patterns of population relationships were found across scenarios. These findings reveal how the choice of reference genome can influence the output of bioinformatic pipelines. It also emphasizes the need to identify best practices and guidelines for the burgeoning field of biodiversity genomics.  相似文献   

9.
Over the next few years, the efficient use of next-generation sequencing (NGS) in human genetics research will depend heavily upon the effective mechanisms for the selective enrichment of genomic regions of interest. Recently, comprehensive exome capture arrays have become available for targeting approximately 33 Mb or ∼180,000 coding exons across the human genome. Selective genomic enrichment of the human exome offers an attractive option for new experimental designs aiming to quickly identify potential disease-associated genetic variants, especially in family-based studies. We have evaluated a 2.1 M feature human exome capture array on eight individuals from a three-generation family pedigree. We were able to cover up to 98% of the targeted bases at a long-read sequence read depth of ≥3, 86% at a read depth of ≥10, and over 50% of all targets were covered with ≥20 reads. We identified up to 14,284 SNPs and small indels per individual exome, with up to 1,679 of these representing putative novel polymorphisms. Applying the conservative genotype calling approach HCDiff, the average rate of detection of a variant allele based on Illumina 1 M BeadChips genotypes was 95.2% at ≥10x sequence. Further, we propose an advantageous genotype calling strategy for low covered targets that empirically determines cut-off thresholds at a given coverage depth based on existing genotype data. Application of this method was able to detect >99% of SNPs covered ≥8x. Our results offer guidance for “real-world” applications in human genetics and provide further evidence that microarray-based exome capture is an efficient and reliable method to enrich for chromosomal regions of interest in next-generation sequencing experiments.  相似文献   

10.
In diploid species, many multiparental populations have been developed to increase genetic diversity and quantitative trait loci (QTL) mapping resolution. In these populations, haplotype reconstruction has been used as a standard practice to increase the power of QTL detection in comparison with the marker-based association analysis. However, such software tools for polyploid species are few and limited to a single biparental F1 population. In this study, a statistical framework for haplotype reconstruction has been developed and implemented in the software PolyOrigin for connected tetraploid F1 populations with shared parents, regardless of the number of parents or mating design. Given a genetic or physical map of markers, PolyOrigin first phases parental genotypes, then refines the input marker map, and finally reconstructs offspring haplotypes. PolyOrigin can utilize single nucleotide polymorphism (SNP) data coming from arrays or from sequence-based genotyping; in the latter case, bi-allelic read counts can be used (and are preferred) as input data to minimize the influence of genotype calling errors at low depth. With extensive simulation we show that PolyOrigin is robust to the errors in the input genotypic data and marker map. It works well for various population designs with 30 offspring per parent and for sequences with read depth as low as 10x. PolyOrigin was further evaluated using an autotetraploid potato dataset with a 3 × 3 half-diallel mating design. In conclusion, PolyOrigin opens up exciting new possibilities for haplotype analysis in tetraploid breeding populations.  相似文献   

11.
Dou J  Zhao X  Fu X  Jiao W  Wang N  Zhang L  Hu X  Wang S  Bao Z 《Biology direct》2012,7(1):17-9
ABSTRACT: BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. RESULTS: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. CONCLUSIONS: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.  相似文献   

12.
With rapid decline of the sequencing cost, researchers today rush to embrace whole genome sequencing (WGS), or whole exome sequencing (WES) approach as the next powerful tool for relating genetic variants to human diseases and phenotypes. A fundamental step in analyzing WGS and WES data is mapping short sequencing reads back to the reference genome. This is an important issue because incorrectly mapped reads affect the downstream variant discovery, genotype calling and association analysis. Although many read mapping algorithms have been developed, the majority of them uses the universal reference genome and do not take sequence variants into consideration. Given that genetic variants are ubiquitous, it is highly desirable if they can be factored into the read mapping procedure. In this work, we developed a novel strategy that utilizes genotypes obtained a priori to customize the universal haploid reference genome into a personalized diploid reference genome. The new strategy is implemented in a program named RefEditor. When applying RefEditor to real data, we achieved encouraging improvements in read mapping, variant discovery and genotype calling. Compared to standard approaches, RefEditor can significantly increase genotype calling consistency (from 43% to 61% at 4X coverage; from 82% to 92% at 20X coverage) and reduce Mendelian inconsistency across various sequencing depths. Because many WGS and WES studies are conducted on cohorts that have been genotyped using array-based genotyping platforms previously or concurrently, we believe the proposed strategy will be of high value in practice, which can also be applied to the scenario where multiple NGS experiments are conducted on the same cohort. The RefEditor sources are available at https://github.com/superyuan/refeditor.
This is a PLOS Computational Biology Software Article.
  相似文献   

13.
14.
Reduced representation genome sequencing such as restriction‐site‐associated DNA (RAD) sequencing is finding increased use to identify and genotype large numbers of single‐nucleotide polymorphisms (SNPs) in model and nonmodel species. We generated a unique resource of novel SNP markers for the European eel using the RAD sequencing approach that was simultaneously identified and scored in a genome‐wide scan of 30 individuals. Whereas genomic resources are increasingly becoming available for this species, including the recent release of a draft genome, no genome‐wide set of SNP markers was available until now. The generated SNPs were widely distributed across the eel genome, aligning to 4779 different contigs and 19 703 different scaffolds. Significant variation was identified, with an average nucleotide diversity of 0.00529 across individuals. Results varied widely across the genome, ranging from 0.00048 to 0.00737 per locus. Based on the average nucleotide diversity across all loci, long‐term effective population size was estimated to range between 132 000 and 1 320 000, which is much higher than previous estimates based on microsatellite loci. The generated SNP resource consisting of 82 425 loci and 376 918 associated SNPs provides a valuable tool for future population genetics and genomics studies and allows for targeting specific genes and particularly interesting regions of the eel genome.  相似文献   

15.
The emergence of third‐generation sequencing (3GS; long‐reads) is bringing closer the goal of chromosome‐size fragments in de novo genome assemblies. This allows the exploration of new and broader questions on genome evolution for a number of nonmodel organisms. However, long‐read technologies result in higher sequencing error rates and therefore impose an elevated cost of sufficient coverage to achieve high enough quality. In this context, hybrid assemblies, combining short‐reads and long‐reads, provide an alternative efficient and cost‐effective approach to generate de novo, chromosome‐level genome assemblies. The array of available software programs for hybrid genome assembly, sequence correction and manipulation are constantly being expanded and improved. This makes it difficult for nonexperts to find efficient, fast and tractable computational solutions for genome assembly, especially in the case of nonmodel organisms lacking a reference genome or one from a closely related species. In this study, we review and test the most recent pipelines for hybrid assemblies, comparing the model organism Drosophila melanogaster to a nonmodel cactophilic Drosophila, D. mojavensis. We show that it is possible to achieve excellent contiguity on this nonmodel organism using the dbg2olc pipeline.  相似文献   

16.
17.
DNA quantity can be a hindrance in ecological and evolutionary research programmes due to a range of factors including endangered status of target organisms, available tissue type, and the impact of field conditions on preservation methods. A potential solution to low‐quantity DNA lies in whole genome amplification (WGA) techniques that can substantially increase DNA yield. To date, few studies have rigorously examined sequence bias that might result from WGA and next‐generation sequencing of nonmodel taxa. To address this knowledge deficit, we use multiple displacement amplification (MDA) and double‐digest RAD sequencing on the grey mouse lemur (Microcebus murinus) to quantify bias in genome coverage and SNP calls when compared to raw genomic DNA (gDNA). We focus our efforts in providing baseline estimates of potential bias by following manufacturer's recommendations for starting DNA quantities (>100 ng). Our results are strongly suggestive that MDA enrichment does not introduce systematic bias to genome characterization. SNP calling between samples when genotyping both de‐novo and with a reference genome are highly congruent (>98%) when specifying a minimum threshold of 20X stack depth to call genotypes. Relative genome coverage is also similar between MDA and gDNA, and allelic dropout is not observed. SNP concordance varies based on coverage threshold, with 95% concordance reached at ~12X coverage genotyping de‐novo and ~7X coverage genotyping with the reference genome. These results suggest that MDA may be a suitable solution for next‐generation molecular ecological studies when DNA quantity would otherwise be a limiting factor.  相似文献   

18.
Population genetic studies in nonmodel organisms are often hampered by a lack of reference genomes that are essential for whole‐genome resequencing. In the light of this, genotyping methods have been developed to effectively eliminate the need for a reference genome, such as genotyping by sequencing or restriction site‐associated DNA sequencing (RAD‐seq). However, what remains relatively poorly studied is how accurately these methods capture both average and variation in genetic diversity across an organism's genome. In this issue of Molecular Ecology Resources, Dutoit et al. (2016) use whole‐genome resequencing data from the collard flycatcher to assess what factors drive heterogeneity in nucleotide diversity across the genome. Using these data, they then simulate how well different sequencing designs, including RAD sequencing, could capture most of the variation in genetic diversity. They conclude that for evolutionary and conservation‐related studies focused on the estimating genomic diversity, researchers should emphasize the number of loci analysed over the number of individuals sequenced.  相似文献   

19.
As sequencing technology improves, an increasing number of projects aim to generate full genome sequence, even for nonmodel taxa. These projects may be feasibly conducted at lower read depths if the alignment can be aided by previously developed genomic resources from a closely related species. We investigated the feasibility of constructing a complete mitochondrial (mt) genome without preamplification or other targeting of the sequence. Here we present a full mt genome sequence (16,463 nucleotides) for the bighorn sheep (Ovis canadensis) generated though alignment of SOLiD short-read sequences to a reference genome. Average read depth was 1240, and each base was covered by at least 36 reads. We then conducted a phylogenomic analysis with 27 other bovid mitogenomes, which placed bighorn sheep firmly in the Ovis clade. These results show that it is possible to generate a complete mitogenome by skimming a low-coverage genomic sequencing library. This technique will become increasingly applicable as the number of taxa with some level of genome sequence rises.  相似文献   

20.
Qu Zhang  Niclas Backström 《Chromosoma》2014,123(1-2):165-168
The complexity of eukaryote genomes makes assembly errors inevitable in the process of constructing reference genomes. Next-generation sequencing (NGS) could provide an efficient way to validate previously assembled genomes. Here, we exploited NGS data to interrogate the chicken reference genome and identified 35 pairs of nearly identical regions with >99.5 % sequence similarity and a median size of 109 kb. Several lines of evidence, including read depth, the composition of junction sequences, and sequence similarity, suggest that these regions present genome assembly errors and should be excluded from forthcoming genomic studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号