首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
High‐throughput sequencing (HTS) is central to the study of population genomics and has an increasingly important role in constructing phylogenies. Choices in research design for sequencing projects can include a wide range of factors, such as sequencing platform, depth of coverage and bioinformatic tools. Simulating HTS data better informs these decisions, as users can validate software by comparing output to the known simulation parameters. However, current standalone HTS simulators cannot generate variant haplotypes under even somewhat complex evolutionary scenarios, such as recombination or demographic change. This greatly reduces their usefulness for fields such as population genomics and phylogenomics. Here I present the R package jackalope that simply and efficiently simulates (i) sets of variant haplotypes from a reference genome and (ii) reads from both Illumina and Pacific Biosciences platforms. Haplotypes can be simulated using phylogenies, gene trees, coalescent‐simulation output, population‐genomic summary statistics, and Variant Call Format (VCF) files. jackalope can simulate single, paired‐end or mate‐pair Illumina reads, as well as reads from Pacific Biosciences. These simulations include sequencing errors, mapping qualities, multiplexing and optical/PCR duplicates. It can read reference genomes from fasta files and can simulate new ones, and all outputs can be written to standard file formats. jackalope is available for Mac, Windows and Linux systems.  相似文献   

2.
The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available reference genomes. Therefore, a common practice is to align reads to the genome of an organism related to the target species; however, this could affect read alignment and bias genotyping. In this study, I conducted an experiment using empirical RADseq datasets generated for two species of salmonids (Actinopterygii; Teleostei; Salmonidae) to address these questions. There are currently reference genomes for six salmonids of varying phylogenetic distance. I aligned the RADseq data to all six genomes and identified variants with several different genotypers, which were then fed into population genetic analyses. Increasing phylogenetic distance between target species and reference genome reduced the proportion of reads that successfully aligned and mapping quality. Reference genome also influenced the number of SNPs that were generated and depth at those SNPs, although the affect varied by genotyper. Inferences of population structure were mixed: increasing reference genome divergence reduced estimates of differentiation but similar patterns of population relationships were found across scenarios. These findings reveal how the choice of reference genome can influence the output of bioinformatic pipelines. It also emphasizes the need to identify best practices and guidelines for the burgeoning field of biodiversity genomics.  相似文献   

3.
We are writing in response to the population and phylogenomics meeting review by Andrews & Luikart ( 2014 ) entitled ‘Recent novel approaches for population genomics data analysis’. Restriction‐site‐associated DNA (RAD) sequencing has become a powerful and useful approach in molecular ecology, with several different published methods now available to molecular ecologists, none of which can be considered the best option in all situations. A&L report that the original RAD protocol of Miller et al. ( 2007 ) and Baird et al. ( 2008 ) is superior to all other RAD variants because putative PCR duplicates can be identified (see Baxter et al. 2011 ), thereby reducing the impact of PCR artefacts on allele frequency estimates (Andrews & Luikart 2014 ). In response, we (i) challenge the assertion that the original RAD protocol minimizes the impact of PCR artefacts relative to that of other RAD protocols, (ii) present additional biases in RADseq that are at least as important as PCR artefacts in selecting a RAD protocol and (iii) highlight the strengths and weaknesses of four different approaches to RADseq which are a representative sample of all RAD variants: the original RAD protocol (mbRAD, Miller et al. 2007 ; Baird et al. 2008 ), double digest RAD (ddRAD, Peterson et al. 2012 ), ezRAD (Toonen et al. 2013 ) and 2bRAD (Wang et al. 2012 ). With an understanding of the strengths and weaknesses of different RAD protocols, researchers can make a more informed decision when selecting a RAD protocol.  相似文献   

4.
Summary Second‐generation sequencing (sec‐gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T's, between 30 and 100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base‐calling. The complexity of the base‐calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across‐sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec‐gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base‐calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base‐calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base‐calling performance.  相似文献   

5.
6.
High-throughput screening (HTS) is used in modern drug discovery to screen hundreds of thousands to millions of compounds on selected protein targets. It is an industrial-scale process relying on sophisticated automation and state-of-the-art detection technologies. Quality control (QC) is an integral part of the process and is used to ensure good quality data and mini mize assay variability while maintaining assay sensitivity. The authors describe new QC methods and show numerous real examples from their biologist-friendly Stat Server HTS application, a custom-developed software tool built from the commercially available S-PLUS and Stat Server statistical analysis and server software. This system remotely processes HTS data using powerful and sophisticated statistical methodology but insulates users from the technical details by outputting results in a variety of readily interpretable graphs and tables. It allows users to visualize HTS data and examine assay performance during the HTS campaign to quickly react to or avoid quality problems.  相似文献   

7.
8.
Next-generation sequencing(NGS) technology has revolutionized and significantly impacted metagenomic research.However,the NGS data usually contains sequencing artifacts such as low-quality reads and contaminating reads,which will significantly compromise downstream analysis.Many quality control(QC) tools have been proposed,however,few of them have been verified to be suitable or efficient for metagenomic data,which are composed of multiple genomes and are more complex than other kinds of NGS data.Here we present a metagenomic data QC method named Meta-QC-Chain.Meta-QC-Chain combines multiple QC functions:technical tests describe input data status and identify potential errors,quality trimming filters poor sequencing-quality bases and reads,and contamination screening identifies higher eukaryotic species,which are considered as contamination for metagenomic data.Most computing processes are optimized based on parallel programming.Testing on an 8-GB real dataset showed that Meta-QC-Chain trimmed low sequencing-quality reads and contaminating reads,and the whole quality control procedure was completed within 20 min.Therefore,Meta-QC-Chain provides a comprehensive,useful and high-performance QC tool for metagenomic data.Meta-QC-Chain is publicly available for free at:http://computationalbioenergy.org/meta-qc-chain.html.  相似文献   

9.
Characterization of highly duplicated genes, such as genes of the major histocompatibility complex (MHC), where multiple loci often co‐amplify, has until recently been hindered by insufficient read depths per amplicon. Here, we used ultra‐deep Illumina sequencing to resolve genotypes at exon 3 of MHC class I genes in the sedge warbler (Acrocephalus schoenobaenus). We sequenced 24 individuals in two replicates and used this data, as well as a simulated data set, to test the effect of amplicon coverage (range: 500–20 000 reads per amplicon) on the repeatability of genotyping using four different genotyping approaches. A third replicate employed unique barcoding to assess the extent of tag jumping, that is swapping of individual tag identifiers, which may confound genotyping. The reliability of MHC genotyping increased with coverage and approached or exceeded 90% within‐method repeatability of allele calling at coverages of >5000 reads per amplicon. We found generally high agreement between genotyping methods, especially at high coverages. High reliability of the tested genotyping approaches was further supported by our analysis of the simulated data set, although the genotyping approach relying primarily on replication of variants in independent amplicons proved sensitive to repeatable errors. According to the most repeatable genotyping method, the number of co‐amplifying variants per individual ranged from 19 to 42. Tag jumping was detectable, but at such low frequencies that it did not affect the reliability of genotyping. We thus demonstrate that gene families with many co‐amplifying genes can be reliably genotyped using HTS, provided that there is sufficient per amplicon coverage.  相似文献   

10.
DNA barcodes are useful for species discovery and species identification, but obtaining barcodes currently requires a well‐equipped molecular laboratory and is time‐consuming, and/or expensive. We here address these issues by developing a barcoding pipeline for Oxford Nanopore MinION? and demonstrating that one flow cell can generate barcodes for ~500 specimens despite the high basecall error rates of MinION? reads. The pipeline overcomes these errors by first summarizing all reads for the same tagged amplicon as a consensus barcode. Consensus barcodes are overall mismatch‐free but retain indel errors that are concentrated in homopolymeric regions. They are addressed with an optional error correction pipeline that is based on conserved amino acid motifs from publicly available barcodes. The effectiveness of this pipeline is documented by analysing reads from three MinION? runs that represent three different stages of MinION? development. They generated data for (i) 511 specimens of a mixed Diptera sample, (ii) 575 specimens of ants and (iii) 50 specimens of Chironomidae. The run based on the latest chemistry yielded MinION? barcodes for 490 of the 511 specimens which were assessed against reference Sanger barcodes (N = 471). Overall, the MinION? barcodes have an accuracy of 99.3%–100% with the number of ambiguous bases after correction ranging from <0.01% to 1.5% depending on which correction pipeline is used. We demonstrate that it requires ~2 hr of sequencing to gather all information needed for obtaining reliable barcodes for most specimens (>90%). We estimate that up to 1,000 barcodes can be generated in one flow cell and that the cost per barcode can be 相似文献   

11.
12.
A goal of many environmental DNA barcoding studies is to infer quantitative information about relative abundances of different taxa based on sequence read proportions generated by high‐throughput sequencing. However, potential biases associated with this approach are only beginning to be examined. We sequenced DNA amplified from faeces (scats) of captive harbour seals (Phoca vitulina) to investigate whether sequence counts could be used to quantify the seals’ diet. Seals were fed fish in fixed proportions, a chordate‐specific mitochondrial 16S marker was amplified from scat DNA and amplicons sequenced using an Ion Torrent PGM?. For a given set of bioinformatic parameters, there was generally low variability between scat samples in proportions of prey species sequences recovered. However, proportions varied substantially depending on sequencing direction, level of quality filtering (due to differences in sequence quality between species) and minimum read length considered. Short primer tags used to identify individual samples also influenced species proportions. In addition, there were complex interactions between factors; for example, the effect of quality filtering was influenced by the primer tag and sequencing direction. Resequencing of a subset of samples revealed some, but not all, biases were consistent between runs. Less stringent data filtering (based on quality scores or read length) generally produced more consistent proportional data, but overall proportions of sequences were very different than dietary mass proportions, indicating additional technical or biological biases are present. Our findings highlight that quantitative interpretations of sequence proportions generated via high‐throughput sequencing will require careful experimental design and thoughtful data analysis.  相似文献   

13.
Recent advances in environmental DNA (eDNA) analysis using high‐throughput sequencing (HTS) enable evaluation of intraspecific genetic diversity in a population. As the intraspecific genetic diversity provides invaluable information for wildlife conservation and management, there is an increasing demand to apply eDNA analysis to population genetics and the phylogeography by quantitative evaluation of intraspecific diversity. However, quantitative evaluations of intraspecific genetic diversity using eDNA is not straightforward because the number of eDNA sequence reads obtained by HTS may not be an index of the quantity of eDNA. In this study, to quantitatively evaluate genetic diversity using eDNA analysis, we applied a quantitative eDNA metabarcoding method using the internal standard DNAs. We targeted Ayu (Plecoglossus altivelis altivelis) and added internal standard DNAs with known copy numbers to each eDNA sample obtained from three rivers during the library preparation process. The sequence reads of each Ayu haplotype were successfully converted to DNA copy numbers based on the relationship between the copy numbers and sequence reads of the internal standard DNAs. In all rivers, the calculated copy number of each haplotype showed a significant positive correlation with the haplotype frequency estimated by a capture‐based survey. Furthermore, estimates of genetic indicators such as nucleotide diversity based on the eDNA copy numbers were comparable with those estimated based on a capture‐based study. Our results demonstrate that eDNA analysis with internal standard DNAs enables reasonable quantification of intraspecific genetic diversity, and this method could thus be a promising tool in the field of population genetics and phylogeography.  相似文献   

14.
Protists, the most diverse eukaryotes, are largely considered to be free‐living bacterivores, but vast numbers of taxa are known to parasitize plants or animals. High‐throughput sequencing (HTS) approaches now commonly replace cultivation‐based approaches in studying soil protists, but insights into common biases associated with this method are limited to aquatic taxa and samples. We created a mock community of common free‐living soil protists (amoebae, flagellates, ciliates), extracted DNA and amplified it in the presence of metazoan DNA using 454 HTS. We aimed at evaluating whether HTS quantitatively reveals true relative abundances of soil protists and at investigating whether the expected protist community structure is altered by the co‐amplification of metazoan‐associated protist taxa. Indeed, HTS revealed fundamentally different protist communities from those expected. Ciliate sequences were highly over‐represented, while those of most amoebae and flagellates were under‐represented or totally absent. These results underpin the biases introduced by HTS that prevent reliable quantitative estimations of free‐living protist communities. Furthermore, we detected a wide range of nonadded protist taxa probably introduced along with metazoan DNA, which altered the protist community structure. Among those, 20 taxa most closely resembled parasitic, often pathogenic taxa. Therewith, we provide the first HTS data in support of classical observational studies that showed that potential protist parasites are hosted by soil metazoa. Taken together, profound differences in amplification success between protist taxa and an inevitable co‐extraction of protist taxa parasitizing soil metazoa obscure the true diversity of free‐living soil protist communities.  相似文献   

15.
DNA analysis of predator faeces using high‐throughput amplicon sequencing (HTS) enhances our understanding of predator–prey interactions. However, conclusions drawn from this technique are constrained by biases that occur in multiple steps of the HTS workflow. To better characterize insectivorous animal diets, we used DNA from a diverse set of arthropods to assess PCR biases of commonly used and novel primer pairs for the mitochondrial gene, cytochrome oxidase C subunit 1 (COI). We compared diversity recovered from HTS of bat guano samples using a commonly used primer pair “ZBJ” to results using the novel primer pair “ANML.” To parameterize our bioinformatics pipeline, we created an arthropod mock community consisting of single‐copy (cloned) COI sequences. To examine biases associated with both PCR and HTS, mock community members were combined in equimolar amounts both pre‐ and post‐PCR. We validated our system using guano from bats fed known diets and using composite samples of morphologically identified insects collected in pitfall traps. In PCR tests, the ANML primer pair amplified 58 of 59 arthropod taxa (98%), whereas ZBJ amplified 24–40 of 59 taxa (41%–68%). Furthermore, in an HTS comparison of field‐collected samples, the ANML primers detected nearly fourfold more arthropod taxa than the ZBJ primers. The additional arthropods detected include medically and economically relevant insect groups such as mosquitoes. Results revealed biases at both the PCR and sequencing levels, demonstrating the pitfalls associated with using HTS read numbers as proxies for abundance. The use of an arthropod mock community allowed for improved bioinformatics pipeline parameterization.  相似文献   

16.
High‐throughput sequencing (HTS) of PCR amplicons is becoming the method of choice to sequence one or several targeted loci for phylogenetic and DNA barcoding studies. Although the development of HTS has allowed rapid generation of massive amounts of DNA sequence data, preparing amplicons for HTS remains a rate‐limiting step. For example, HTS platforms require platform‐specific adapter sequences to be present at the 5′ and 3′ end of the DNA fragment to be sequenced. In addition, short multiplex identifier (MID) tags are typically added to allow multiple samples to be pooled in a single HTS run. Existing methods to incorporate HTS adapters and MID tags into PCR amplicons are either inefficient, requiring multiple enzymatic reactions and clean‐up steps, or costly when applied to multiple samples or loci (fusion primers). We describe a method to amplify a target locus and add HTS adapters and MID tags via a linker sequence using a single PCR. We demonstrate our approach by generating reference sequence data for two mitochondrial loci (COI and 16S) for a diverse suite of insect taxa. Our approach provides a flexible, cost‐effective and efficient method to prepare amplicons for HTS.  相似文献   

17.
We offer a guide to de novo genome assembly1 using sequence data generated by the Illumina platform for biologists working with fungi or other organisms whose genomes are less than 100 Mb in size. The guide requires no familiarity with sequencing assembly technology or associated computer programs. It defines commonly used terms in genome sequencing and assembly; provides examples of assembling short-read genome sequence data for four strains of the fungus Grosmannia clavigera using four assembly programs; gives examples of protocols and software; and presents a commented flowchart that extends from DNA preparation for submission to a sequencing center, through to processing and assembly of the raw sequence reads using freely available operating systems and software.  相似文献   

18.
19.
The genotyping of highly polymorphic multigene families across many individuals used to be a particularly challenging task because of methodological limitations associated with traditional approaches. Next‐generation sequencing (NGS) can overcome most of these limitations, and it is increasingly being applied in population genetic studies of multigene families. Here, we critically review NGS bioinformatic approaches that have been used to genotype the major histocompatibility complex (MHC) immune genes, and we discuss how the significant advances made in this field are applicable to population genetic studies of gene families. Increasingly, approaches are introduced that apply thresholds of sequencing depth and sequence similarity to separate alleles from methodological artefacts. We explain why these approaches are particularly sensitive to methodological biases by violating fundamental genotyping assumptions. An alternative strategy that utilizes ultra‐deep sequencing (hundreds to thousands of sequences per amplicon) to reconstruct genotypes and applies statistical methods on the sequencing depth to separate alleles from artefacts appears to be more robust. Importantly, the ‘degree of change’ (DOC) method avoids using arbitrary cut‐off thresholds by looking for statistical boundaries between the sequencing depth for alleles and artefacts, and hence, it is entirely repeatable across studies. Although the advances made in generating NGS data are still far ahead of our ability to perform reliable processing, analysis and interpretation, the community is developing statistically rigorous protocols that will allow us to address novel questions in evolution, ecology and genetics of multigene families. Future developments in third‐generation single molecule sequencing may potentially help overcome problems that still persist in de novo multigene amplicon genotyping when using current second‐generation sequencing approaches.  相似文献   

20.

Background

Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.

Results

We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software.

Conclusions

Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号