首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 937 毫秒
1.
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.  相似文献   

2.
Smeds L  Künstner A 《PloS one》2011,6(10):e26314
During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data. AVAILABILITY AND IMPLEMENTATION: Freely available on the web at http://code.google.com/p/condetri.  相似文献   

3.
The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Although there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn’t taken into account the sequencing errors when dealing with the duplicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net/.  相似文献   

4.
MOTIVATION: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions. RESULTS: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.  相似文献   

5.
ABSTRACT: BACKGROUND: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. RESULTS: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and the reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/~vwang/shadowRegression.html, and will be submitted to Bioconductor upon publication of this article. CONCLUSIONS: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.  相似文献   

6.
7.
8.
Fragment assembly with short reads   总被引:5,自引:0,他引:5  
MOTIVATION: Current DNA sequencing technology produces reads of about 500-750 bp, with typical coverage under 10x. New sequencing technologies are emerging that produce shorter reads (length 80-200 bp) but allow one to generate significantly higher coverage (30x and higher) at low cost. Modern assembly programs and error correction routines have been tuned to work well with current read technology but were not designed for assembly of short reads. RESULTS: We analyze the limitations of assembling reads generated by these new technologies and present a routine for base-calling in reads prior to their assembly. We demonstrate that while it is feasible to assemble such short reads, the resulting contigs will require significant (if not prohibitive) finishing efforts. AVAILABILITY: Available from the web at http://www.cse.ucsd.edu/groups/bioinformatics/software.html  相似文献   

9.
He  Feifei  Li  Yang  Tang  Yu-Hang  Ma  Jian  Zhu  Huaiqiu 《BMC genomics》2016,17(1):141-151
Background

The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads.

Results

The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp.

Conclusions

To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID.

  相似文献   

10.
In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA–MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.  相似文献   

11.
Kong Y 《Genomics》2011,98(2):152-153
Btrim is a fast and lightweight software to trim adapters and low quality regions in reads from ultra high-throughput next-generation sequencing machines. It also can reliably identify barcodes and assign the reads to the original samples. Based on a modified Myers's bit-vector dynamic programming algorithm, Btrim can handle indels in adapters and barcodes. It removes low quality regions and trims off adapters at both or either end of the reads. A typical trimming of 30 M reads with two sets of adapter pairs can be done in about a minute with a small memory footprint. Btrim is a versatile stand-alone tool that can be used as the first step in virtually all next-generation sequence analysis pipelines. The program is available at http://graphics.med.yale.edu/trim/.  相似文献   

12.
13.

Background

Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved.

Methodology/Principal Findings

We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.

Conclusions/Significance

This strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.  相似文献   

14.

Background

Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads.

Results

We first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent.

Conclusions

We provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (https://github.com/harvardinformatics/NGmerge) under the MIT License; it is written in C and supported on Linux.
  相似文献   

15.
Third‐generation sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), have gained popularity over the last years. These platforms can generate millions of long‐read sequences. This is not only advantageous for genome sequencing projects, but also advantageous for amplicon‐based high‐throughput sequencing experiments, such as DNA barcoding. However, the relatively high error rates associated with these technologies still pose challenges for generating high‐quality consensus sequences. Here, we present NGSpeciesID, a program which can generate highly accurate consensus sequences from long‐read amplicon sequencing technologies, including ONT and PacBio. The tool includes clustering of the reads to help filter out contaminants or reads with high error rates and employs polishing strategies specific to the appropriate sequencing platform. We show that NGSpeciesID produces consensus sequences with improved usability by minimizing preprocessing and software installation and scalability by enabling rapid processing of hundreds to thousands of samples, while maintaining similar consensus accuracy as current pipelines.  相似文献   

16.

Background

Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes.

Methodology/Principal Findings

We performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35–50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software.

Conclusions/Significance

In complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results.  相似文献   

17.
Next-generation sequencing (NGS) technologies have been widely used in life sciences. However, several kinds of sequencing artifacts, including low-quality reads and contaminating reads, were found to be quite common in raw sequencing data, which compromise downstream analysis. Therefore, quality control (QC) is essential for raw NGS data. However, although a few NGS data quality control tools are publicly available, there are two limitations: First, the processing speed could not cope with the rapid increase of large data volume. Second, with respect to removing the contaminating reads, none of them could identify contaminating sources de novo, and they rely heavily on prior information of the contaminating species, which is usually not available in advance. Here we report QC-Chain, a fast, accurate and holistic NGS data quality-control method. The tool synergeticly comprised of user-friendly tools for (1) quality assessment and trimming of raw reads using Parallel-QC, a fast read processing tool; (2) identification, quantification and filtration of unknown contamination to get high-quality clean reads. It was optimized based on parallel computation, so the processing speed is significantly higher than other QC methods. Experiments on simulated and real NGS data have shown that reads with low sequencing quality could be identified and filtered. Possible contaminating sources could be identified and quantified de novo, accurately and quickly. Comparison between raw reads and processed reads also showed that subsequent analyses (genome assembly, gene prediction, gene annotation, etc.) results based on processed reads improved significantly in completeness and accuracy. As regard to processing speed, QC-Chain achieves 7–8 time speed-up based on parallel computation as compared to traditional methods. Therefore, QC-Chain is a fast and useful quality control tool for read quality process and de novo contamination filtration of NGS reads, which could significantly facilitate downstream analysis. QC-Chain is publicly available at: http://www.computationalbioenergy.org/qc-chain.html.  相似文献   

18.
High-throughput sequencing technologies produce short sequence reads that can contain phase information if they span two or more heterozygote genotypes. This information is not routinely used by current methods that infer haplotypes from genotype data. We have extended the SHAPEIT2 method to use phase-informative sequencing reads to improve phasing accuracy. Our model incorporates the read information in a probabilistic model through base quality scores within each read. The method is primarily designed for high-coverage sequence data or data sets that already have genotypes called. One important application is phasing of single samples sequenced at high coverage for use in medical sequencing and studies of rare diseases. Our method can also use existing panels of reference haplotypes. We tested the method by using a mother-father-child trio sequenced at high-coverage by Illumina together with the low-coverage sequence data from the 1000 Genomes Project (1000GP). We found that use of phase-informative reads increases the mean distance between switch errors by 22% from 274.4 kb to 328.6 kb. We also used male chromosome X haplotypes from the 1000GP samples to simulate sequencing reads with varying insert size, read length, and base error rate. When using short 100 bp paired-end reads, we found that using mixtures of insert sizes produced the best results. When using longer reads with high error rates (5–20 kb read with 4%–15% error per base), phasing performance was substantially improved.  相似文献   

19.
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.  相似文献   

20.
SUMMARY: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses. Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号