首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Massively parallel sequencing is rapidly emerging as an efficient way to quantify biodiversity at all levels, from genetic variation and expression to ecological community assemblage. However, the number of reads produced per sequencing run far exceeds the number required per sample for many applications, compelling researchers to sequence multiple samples per run in order to maximize efficiency. For studies that include a PCR step, this can be accomplished using primers that include an index sequence allowing sample origin to be determined after sequencing. The use of indexed primers assumes they behave no differently than standard primers; however, we found that indexed primers cause substantial template sequence-specific bias, resulting in radically different profiles of the same environmental sample. Likely the outcome of differential amplification efficiency due to primer-template mismatch, two indexed primer sets spuriously change the inferred sequence abundance from the same DNA extraction by up to 77.1%. We demonstrate that a double PCR approach alleviates these effects in applications where indexed primers are necessary.  相似文献   

2.
Constructing mixtures of tagged or bar-coded DNAs for sequencing is an important requirement for the efficient use of next-generation sequencers in applications where limited sequence data are required per sample. There are many applications in which next-generation sequencing can be used effectively to sequence large mixed samples; an example is the characterization of microbial communities where ≤1,000 sequences per samples are adequate to address research questions. Thus, it is possible to examine hundreds to thousands of samples per run on massively parallel next-generation sequencers. However, the cost savings for efficient utilization of sequence capacity is realized only if the production and management costs associated with construction of multiplex pools are also scalable. One critical step in multiplex pool construction is the normalization process, whereby equimolar amounts of each amplicon are mixed. Here we compare three approaches (spectroscopy, size-restricted spectroscopy, and quantitative binding) for normalization of large, multiplex amplicon pools for performance and efficiency. We found that the quantitative binding approach was superior and represents an efficient scalable process for construction of very large, multiplex pools with hundreds and perhaps thousands of individual amplicons included. We demonstrate the increased sequence diversity identified with higher throughput. Massively parallel sequencing can dramatically accelerate microbial ecology studies by allowing appropriate replication of sequence acquisition to account for temporal and spatial variations. Further, population studies to examine genetic variation, which require even lower levels of sequencing, should be possible where thousands of individual bar-coded amplicons are examined in parallel.Emergent technologies that generate DNA sequence data are designed primarily to perform resequencing projects at reasonable cost. The result is a substantial decrease in per base costs from traditional methods. However, these next-generation platforms do not readily accommodate projects that require obtaining moderate amounts of sequence from large numbers of samples. These platforms also have per run costs that are significant and generally preclude large numbers of single-sample, nonmultiplexed runs. One example of research that is not readily supported is rRNA-directed metagenomics study of some human clinical samples or environmental rRNA analysis of samples from communities with low community diversity that require only thousands of sequences. Thus, strategies to utilize next-generation DNA sequencers efficiently for applications that require lower throughput are critical to capitalize on the efficiency and cost benefits of next-generation sequencing platforms.Directed metagenomics based on amplification of rRNA genes is an important tool to characterize microbial communities in various environmental and clinical settings. In diverse environmental samples, large numbers of sequences are required to fully characterize the microbial communities (15). However, a lower number of sequences is generally adequate to answer specific research questions. In addition, the levels of diversity in human clinical samples are usually lower than what is observed in environmental samples (for example, see reference 7).The Roche 454 genome sequencer system FLX pyrosequencer (which we will refer to as 454 FLX hereafter) is the most useful platform for rRNA-directed metagenomics because it currently provides the longest read lengths of any next-generation sequencing platform (1, 14). Computational analysis has shown that the 250-nucleotide read length (available from the 454 FLX-LR chemistry) is adequate for identification of bacteria if the amplified region is properly positioned within variable regions of the small-subunit rRNA (SSU-rRNA) gene (9, 10).In this study, we used the 454 FLX-LR genome sequencing platform and chemistry, which provides >400,000 sequences of ∼250 bp per run. After we conducted this study, a new reagent set (454 FLX-XLR titanium chemistry) was released, which further increases reads to >1,000,000 and read lengths to >400 bp (Roche). The 454 FLX platform dramatically reduces per base costs of obtaining sequence, and physical separation into between 2 and 16 lanes is available; this physical separation on the plate reduces sequencing output overall, up to 40% comparing 2 lanes versus 16 lanes. For applications where modest sequencing depth (∼1,000 sequences per sample) is adequate to address research questions, physical separation does not allow adequate sample multiplexing because even a 1/16 454 FLX-LR plate run is expected to produce ∼15,000 reads. Further, the utility of the platform as a screening tool at 16-plex is limited by cost per run.A solution to make next-generation sequencing economical for projects such as rRNA-directed metagenomics is to use bar-coded primers to multiplex amplicon pools so they can be sequenced together and computationally separated afterward (6). To successfully accomplish this strategy, precise normalization of the DNA concentrations of the individual amplicons in the multiplex pools is essential for effective multiplex sequencing when large numbers of pooled samples are sequenced in parallel. There are several potential methods available for normalizing concentrations of amplicons included in multiplex pools, but the relative and absolute performance of each approach has not been compared.In this study, we present a direct quantitative comparison of three available methods for amplicon pool normalization for downstream next-generation sequencing. The central goal of the study was to identify the most effective method for normalizing multiplex pools containing >100 individual amplicons. We evaluated each pooling approach by 454 sequencing and compared the observed frequencies of sequences from different pooled bar-coded amplicons. From these data, we determined the efficacy of each method based on the following factors: (i) how well normalized the sequences within the pool were, (ii) the proportion of samples failing to meet a minimum threshold of sequences per sample, and (iii) the overall efficiency (speed and labor required) of the process to multiplex samples.  相似文献   

3.
High-throughput sequencing technologies produce short sequence reads that can contain phase information if they span two or more heterozygote genotypes. This information is not routinely used by current methods that infer haplotypes from genotype data. We have extended the SHAPEIT2 method to use phase-informative sequencing reads to improve phasing accuracy. Our model incorporates the read information in a probabilistic model through base quality scores within each read. The method is primarily designed for high-coverage sequence data or data sets that already have genotypes called. One important application is phasing of single samples sequenced at high coverage for use in medical sequencing and studies of rare diseases. Our method can also use existing panels of reference haplotypes. We tested the method by using a mother-father-child trio sequenced at high-coverage by Illumina together with the low-coverage sequence data from the 1000 Genomes Project (1000GP). We found that use of phase-informative reads increases the mean distance between switch errors by 22% from 274.4 kb to 328.6 kb. We also used male chromosome X haplotypes from the 1000GP samples to simulate sequencing reads with varying insert size, read length, and base error rate. When using short 100 bp paired-end reads, we found that using mixtures of insert sizes produced the best results. When using longer reads with high error rates (5–20 kb read with 4%–15% error per base), phasing performance was substantially improved.  相似文献   

4.
Impressive progress has been made in the field of Next Generation Sequencing (NGS). Through advancements in the fields of molecular biology and technical engineering, parallelization of the sequencing reaction has profoundly increased the total number of produced sequence reads per run. Current sequencing platforms allow for a previously unprecedented view into complex mixtures of RNA and DNA samples. NGS is currently evolving into a molecular microscope finding its way into virtually every fields of biomedical research. In this chapter we review the technical background of the different commercially available NGS platforms with respect to template generation and the sequencing reaction and take a small step towards what the upcoming NGS technologies will bring. We close with an overview of different implementations of NGS into biomedical research. This article is part of a Special Issue entitled: From Genome to Function.  相似文献   

5.
Formalin fixing with paraffin embedding (FFPE) has been a standard sample preparation method for decades, and archival FFPE samples are still very useful resources. Nonetheless, the use of FFPE samples in cancer genome analysis using next-generation sequencing, which is a powerful technique for the identification of genomic alterations at the nucleotide level, has been challenging due to poor DNA quality and artificial sequence alterations. In this study, we performed whole-exome sequencing of matched frozen samples and FFPE samples of tissues from 4 cancer patients and compared the next-generation sequencing data obtained from these samples. The major differences between data obtained from the 2 types of sample were the shorter insert size and artificial base alterations in the FFPE samples. A high proportion of short inserts in the FFPE samples resulted in overlapping paired reads, which could lead to overestimation of certain variants; >20% of the inserts in the FFPE samples were double sequenced. A large number of soft clipped reads was found in the sequencing data of the FFPE samples, and about 30% of total bases were soft clipped. The artificial base alterations, C>T and G>A, were observed in FFPE samples only, and the alteration rate ranged from 200 to 1,200 per 1M bases when sequencing errors were removed. Although high-confidence mutation calls in the FFPE samples were compatible to that in the frozen samples, caution should be exercised in terms of the artifacts, especially for low-confidence calls. Despite the clearly observed artifacts, archival FFPE samples can be a good resource for discovery or validation of biomarkers in cancer research based on whole-exome sequencing.  相似文献   

6.
With the decreasing cost of next-generation sequencing, deep sequencing of clinical samples provides unique opportunities to understand host-associated microbial communities. Among the primary challenges of clinical metagenomic sequencing is the rapid filtering of human reads to survey for pathogens with high specificity and sensitivity. Metagenomes are inherently variable due to different microbes in the samples and their relative abundance, the size and architecture of genomes, and factors such as target DNA amounts in tissue samples (i.e. human DNA versus pathogen DNA concentration). This variation in metagenomes typically manifests in sequencing datasets as low pathogen abundance, a high number of host reads, and the presence of close relatives and complex microbial communities. In addition to these challenges posed by the composition of metagenomes, high numbers of reads generated from high-throughput deep sequencing pose immense computational challenges. Accurate identification of pathogens is confounded by individual reads mapping to multiple different reference genomes due to gene similarity in different taxa present in the community or close relatives in the reference database. Available global and local sequence aligners also vary in sensitivity, specificity, and speed of detection. The efficiency of detection of pathogens in clinical samples is largely dependent on the desired taxonomic resolution of the organisms. We have developed an efficient strategy that identifies “all against all” relationships between sequencing reads and reference genomes. Our approach allows for scaling to large reference databases and then genome reconstruction by aggregating global and local alignments, thus allowing genetic characterization of pathogens at higher taxonomic resolution. These results were consistent with strain level SNP genotyping and bacterial identification from laboratory culture.  相似文献   

7.
Determination of sequence variation within a genetic locus to develop clinically relevant databases is critical for molecular assay design and clinical test interpretation, so multisample pooling for Illumina genome analyzer (GA) sequencing was investigated using the RET proto-oncogene as a model. Samples were Sanger-sequenced for RET exons 10, 11, and 13–16. Ten samples with 13 known unique variants (“singleton variants” within the pool) and seven common changes were amplified and then equimolar-pooled before sequencing on a single flow cell lane, generating 36 base reads. For comparison, a single “control” sample was run in a different lane. After alignment, a 24-base quality score-screening threshold and 3` read end trimming of three bases yielded low background error rates with a 27% decrease in aligned read coverage. Sequencing data were evaluated using an established variant detection method (percent variant reads), by the presented subtractive correction method, and with SNPSeeker software. In total, 41 variants (of which 23 were singleton variants) were detected in the 10 pool data, which included all Sanger-identified variants. The 23 singleton variants were detected near the expected 5% allele frequency (average 5.17%±0.90% variant reads), well above the highest background error (1.25%). Based on background error rates, read coverage, simulated 30, 40, and 50 sample pool data, expected singleton allele frequencies within pools, and variant detection methods; ≥30 samples (which demonstrated a minimum 1% variant reads for singletons) could be pooled to reliably detect singleton variants by GA sequencing.  相似文献   

8.
Genome copy number is an important source of genetic variation in health and disease. In cancer, Copy Number Alterations (CNAs) can be inferred from short-read sequencing data, enabling genomics-based precision oncology. Emerging Nanopore sequencing technologies offer the potential for broader clinical utility, for example in smaller hospitals, due to lower instrument cost, higher portability, and ease of use. Nonetheless, Nanopore sequencing devices are limited in the number of retrievable sequencing reads/molecules compared to short-read sequencing platforms, limiting CNA inference accuracy. To address this limitation, we targeted the sequencing of short-length DNA molecules loaded at optimized concentration in an effort to increase sequence read/molecule yield from a single nanopore run. We show that short-molecule nanopore sequencing reproducibly returns high read counts and allows high quality CNA inference. We demonstrate the clinical relevance of this approach by accurately inferring CNAs in acute myeloid leukemia samples. The data shows that, compared to traditional approaches such as chromosome analysis/cytogenetics, short molecule nanopore sequencing returns more sensitive, accurate copy number information in a cost effective and expeditious manner, including for multiplex samples. Our results provide a framework for short-molecule nanopore sequencing with applications in research and medicine, which includes but is not limited to, CNAs.  相似文献   

9.
10.
11.
12.

Background  

High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.  相似文献   

13.

Background

Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes.

Methodology/Principal Findings

We performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35–50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software.

Conclusions/Significance

In complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results.  相似文献   

14.
15.
DNA sample contamination is a serious problem in DNA sequencing studies and may result in systematic genotype misclassification and false positive associations. Although methods exist to detect and filter out cross-species contamination, few methods to detect within-species sample contamination are available. In this paper, we describe methods to identify within-species DNA sample contamination based on (1) a combination of sequencing reads and array-based genotype data, (2) sequence reads alone, and (3) array-based genotype data alone. Analysis of sequencing reads allows contamination detection after sequence data is generated but prior to variant calling; analysis of array-based genotype data allows contamination detection prior to generation of costly sequence data. Through a combination of analysis of in silico and experimentally contaminated samples, we show that our methods can reliably detect and estimate levels of contamination as low as 1%. We evaluate the impact of DNA contamination on genotype accuracy and propose effective strategies to screen for and prevent DNA contamination in sequencing studies.  相似文献   

16.
Next generation sequencing technology has revolutionised microbiology by allowing concurrent analysis of whole microbial communities. Here we developed and verified similar methods for the analysis of fungal communities using a proton release sequencing platform with the ability to sequence reads of up to 400 bp in length at significant depth. This read length permits the sequencing of amplicons from commonly used fungal identification regions and thereby taxonomic classification. Using the 400 bp sequencing capability, we have sequenced amplicons from the ITS1, ITS2 and LSU fungal regions to a depth of approximately 700,000 raw reads per sample. Representative operational taxonomic units (OTUs) were chosen by the USEARCH algorithm, and identified taxonomically through nucleotide blast (BLASTn). Combination of this sequencing technology with the bioinformatics pipeline allowed species recognition in two controlled fungal spore populations containing members of known identity and concentration. Each species included within the two controlled populations was found to correspond to a representative OTU, and these OTUs were found to be highly accurate representations of true biological sequences. However, the absolute number of reads attributed to each OTU differed among species. The majority of species were represented by an OTU derived from all three genomic regions although in some cases, species were only represented in two of the regions due to the absence of conserved primer binding sites or due to sequence composition. It is apparent from our data that proton release sequencing technologies can deliver a qualitative assessment of the fungal members comprising a sample. The fact that some fungi cannot be amplified by specific “conserved” primer pairs confirms our recommendation that a multi-region approach be taken for other amplicon-based metagenomic studies.  相似文献   

17.
Genomic structural variation (SV), a common hallmark of cancer, has important predictive and therapeutic implications. However, accurately detecting SV using high-throughput sequencing data remains challenging, especially for ‘targeted’ resequencing efforts. This is critically important in the clinical setting where targeted resequencing is frequently being applied to rapidly assess clinically actionable mutations in tumor biopsies in a cost-effective manner. We present BreaKmer, a novel approach that uses a ‘kmer’ strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants. Relative to four publically available algorithms, BreaKmer detected SV with increased sensitivity and limited calls in non-tumor samples, key features for variant analysis of tumor specimens in both the clinical and research settings.  相似文献   

18.
With the severe acute respiratory syndrome epidemic of 2003 and renewed attention on avian influenza viral pandemics, new surveillance systems are needed for the earlier detection of emerging infectious diseases. We applied a “next-generation” parallel sequencing platform for viral detection in nasopharyngeal and fecal samples collected during seasonal influenza virus (Flu) infections and norovirus outbreaks from 2005 to 2007 in Osaka, Japan. Random RT-PCR was performed to amplify RNA extracted from 0.1–0.25 ml of nasopharyngeal aspirates (N = 3) and fecal specimens (N = 5), and more than 10 µg of cDNA was synthesized. Unbiased high-throughput sequencing of these 8 samples yielded 15,298–32,335 (average 24,738) reads in a single 7.5 h run. In nasopharyngeal samples, although whole genome analysis was not available because the majority (>90%) of reads were host genome–derived, 20–460 Flu-reads were detected, which was sufficient for subtype identification. In fecal samples, bacteria and host cells were removed by centrifugation, resulting in gain of 484–15,260 reads of norovirus sequence (78–98% of the whole genome was covered), except for one specimen that was under-detectable by RT-PCR. These results suggest that our unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information. Although its cost and technological availability make it unlikely that this system will very soon be the diagnostic standard worldwide, this system could be useful for the earlier discovery of novel emerging viruses and bioterrorism, which are difficult to detect with conventional procedures.  相似文献   

19.
20.
The classical theory of shotgun DNA sequencing accounts for neither the placement dependencies that are a fundamental consequence of the forward-reverse sequencing strategy, nor the edge effect that arises for small to moderate-sized genomic targets. These phenomena are relevant to a number of sequencing scenarios, including large-insert BAC and fosmid clones, filtered genomic libraries, and macro-nuclear chromosomes. Here, we report a model that considers these two effects and provides both the expected value of coverage and its variance. Comparison to methyl-filtered maize data shows significant improvement over classical theory. The model is used to analyze coverage performance over a range of small to moderately-sized genomic targets. We find that the read pairing effect and the edge effect interact in a non-trivial fashion. Shorter reads give superior coverage per unit sequence depth relative to longer ones. In principle, end-sequences can be optimized with respect to template insert length; however, optimal performance is unlikely to be realized in most cases because of inherent size variation in any set of targets. Conversely, single-stranded reads exhibit roughly the same coverage attributes as optimized end-reads. Although linking information is lost, single-stranded data should not pose a significant assembly liability if the target represents predominantly low-copy sequence. We also find that random sequencing should be halted at substantially lower redundancies than those now associated with larger projects. Given the enormous amount of data generated per cycle on pyro-sequencing instruments, this observation suggests devising schemes to split each run cycle between twoor more projects. This would prevent over-sequencing and would further leverage the pyrosequencing method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号