期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

CGAP-Align: A High Performance DNA Short Read Alignment Tool

Yaoliang Chen Ji Hong Wanyun Cui Jacques Zaneveld Wei Wang Richard Gibbs Yanghua Xiao Rui Chen 《PloS one》2013,8(4)

Background

Next generation sequencing platforms have greatly reduced sequencing costs, leading to the production of unprecedented amounts of sequence data. BWA is one of the most popular alignment tools due to its relatively high accuracy. However, mapping reads using BWA is still the most time consuming step in sequence analysis. Increasing mapping efficiency would allow the community to better cope with ever expanding volumes of sequence data.

Results

We designed a new program, CGAP-align, that achieves a performance improvement over BWA without sacrificing recall or precision. This is accomplished through the use of Suffix Tarray, a novel data structure combining elements of Suffix Array and Suffix Tree. We also utilize a tighter lower bound estimation for the number of mismatches in a read, allowing for more effective pruning during inexact mapping. Evaluation of both simulated and real data suggests that CGAP-align consistently outperforms the current version of BWA and can achieve over twice its speed under certain conditions, all while obtaining nearly identical results.

Conclusion

CGAP-align is a new time efficient read alignment tool that extends and improves BWA. The increase in alignment speed will be of critical assistance to all sequence-based research and medicine. CGAP-align is freely available to the academic community at http://sourceforge.net/p/cgap-align under the GNU General Public License (GPL). 相似文献

2.

BFAST: An Alignment Tool for Large Scale Genome Resequencing

Nils Homer Barry Merriman Stanley F. Nelson 《PloS one》2009,4(11)

Background

The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.

Methodology

We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions

We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net. 相似文献

3.

SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications

Mengyao Zhao Wan-Ping Lee Erik P. Garrison Gabor T. Marth 《PloS one》2013,8(12)

Background

The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.

Results

To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.

Conclusions

The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library. 相似文献

4.

Mapping Accuracy of Short Reads from Massively Parallel Sequencing and the Implications for Quantitative Expression Profiling

Nicola Palmieri Christian Schl?tterer 《PloS one》2009,4(7)

Background

Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes.

Methodology/Principal Findings

We performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35–50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software.

Conclusions/Significance

In complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results. 相似文献

5.

Sealer: a scalable gap-closing application for finishing draft genomes

Daniel Paulino René L. Warren Benjamin P. Vandervalk Anthony Raymond Shaun D. Jackman Inan? Birol 《BMC bioinformatics》2015,16(1)

Background

While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment “gaps” – uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes.

Results

Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively – a feat that is not possible with other leading tools with the breadth of data used in our study.

Conclusion

Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0663-4) contains supplementary material, which is available to authorized users. 相似文献

6.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

Rachid Ounit Steve Wanamaker Timothy J Close Stefano Lonardi 《BMC genomics》2015,16(1)

相似文献

7.

miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data

Jiyuan An John Lai Atul Sajjanhar Melanie L Lehman Colleen C Nelson 《BMC bioinformatics》2014,15(1)

Background

Small RNA sequencing is commonly used to identify novel miRNAs and to determine their expression levels in plants. There are several miRNA identification tools for animals such as miRDeep, miRDeep2 and miRDeep*. miRDeep-P was developed to identify plant miRNA using miRDeep’s probabilistic model of miRNA biogenesis, but it depends on several third party tools and lacks a user-friendly interface. The objective of our miRPlant program is to predict novel plant miRNA, while providing a user-friendly interface with improved accuracy of prediction.

Result

We have developed a user-friendly plant miRNA prediction tool called miRPlant. We show using 16 plant miRNA datasets from four different plant species that miRPlant has at least a 10% improvement in accuracy compared to miRDeep-P, which is the most popular plant miRNA prediction tool. Furthermore, miRPlant uses a Graphical User Interface for data input and output, and identified miRNA are shown with all RNAseq reads in a hairpin diagram.

Conclusions

We have developed miRPlant which extends miRDeep* to various plant species by adopting suitable strategies to identify hairpin excision regions and hairpin structure filtering for plants. miRPlant does not require any third party tools such as mapping or RNA secondary structure prediction tools. miRPlant is also the first plant miRNA prediction tool that dynamically plots miRNA hairpin structure with small reads for identified novel miRNAs. This feature will enable biologists to visualize novel pre-miRNA structure and the location of small RNA reads relative to the hairpin. Moreover, miRPlant can be easily used by biologists with limited bioinformatics skills.miRPlant and its manual are freely available at http://www.australianprostatecentre.org/research/software/mirplant or http://sourceforge.net/projects/mirplant/.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-275) contains supplementary material, which is available to authorized users. 相似文献

8.

IM-TORNADO: A Tool for Comparison of 16S Reads from Paired-End Libraries

Patricio Jeraldo Krishna Kalari Xianfeng Chen Jaysheel Bhavsar Ashutosh Mangalam Bryan White Heidi Nelson Jean-Pierre Kocher Nicholas Chia 《PloS one》2014,9(12)

Motivation

16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads.

Results

We created a workflow known as Illinois Mayo Taxon Organization from RNA Dataset Operations (IM-TORNADO) for processing non-overlapping reads while retaining maximal information content. Using synthetic mock datasets, we show that the use of both reads produced answers with greater correlation to those from full length 16S rDNA when looking at taxonomy, phylogeny, and beta-diversity.

Availability and Implementation

IM-TORNADO is freely available at http://sourceforge.net/projects/imtornado and produces BIOM format output for cross compatibility with other pipelines such as QIIME, mothur, and phyloseq. 相似文献

9.

SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming

Shreepriya Das Haris Vikalo 《BMC genomics》2015,16(1)

Background

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.

Results

We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.

Conclusion

Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided. 相似文献

10.

Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies

Matt J. Cahill Claudio U. K?ser Nicholas E. Ross John A. C. Archer 《PloS one》2010,5(7)

Background

There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats.

Methodology/Principal Findings

Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads.

Conclusions

Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length. 相似文献

11.

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

Huan Fan Anthony R. Ives Yann Surget-Groba Charles H. Cannon 《BMC genomics》2015,16(1)

Background

Next-generation sequencing technologies are rapidly generating whole-genome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for non-model genomes and multi-genome alignment are challenging.

Results

To greatly simplify the analysis, we present an Assembly and Alignment-Free (AAF) method (https://sourceforge.net/projects/aaf-phylogeny) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on non-model organisms.

Conclusion

Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1647-5) contains supplementary material, which is available to authorized users. 相似文献

12.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly

Jamison M McCorrison Pratap Venepally Indresh Singh Derrick E Fouts Roger S Lasken Barbara A Methé 《BMC bioinformatics》2014,15(1)

相似文献

13.

SHEAR: sample heterogeneity estimation and assembly by reference

Sean R Landman Tae Hyun Hwang Kevin AT Silverstein Yingming Li Scott M Dehm Michael Steinbach Vipin Kumar 《BMC genomics》2014,15(1)

Background

Personal genome assembly is a critical process when studying tumor genomes and other highly divergent sequences. The accuracy of downstream analyses, such as RNA-seq and ChIP-seq, can be greatly enhanced by using personal genomic sequences rather than standard references. Unfortunately, reads sequenced from these types of samples often have a heterogeneous mix of various subpopulations with different variants, making assembly extremely difficult using existing assembly tools. To address these challenges, we developed SHEAR (Sample Heterogeneity Estimation and Assembly by Reference; http://vk.cs.umn.edu/SHEAR), a tool that predicts SVs, accounts for heterogeneous variants by estimating their representative percentages, and generates personal genomic sequences to be used for downstream analysis.

Results

By making use of structural variant detection algorithms, SHEAR offers improved performance in the form of a stronger ability to handle difficult structural variant types and better computational efficiency. We compare against the lead competing approach using a variety of simulated scenarios as well as real tumor cell line data with known heterogeneous variants. SHEAR is shown to successfully estimate heterogeneity percentages in both cases, and demonstrates an improved efficiency and better ability to handle tandem duplications.

Conclusion

SHEAR allows for accurate and efficient SV detection and personal genomic sequence generation. It is also able to account for heterogeneous sequencing samples, such as from tumor tissue, by estimating the subpopulation percentage for each heterogeneous variant.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-84) contains supplementary material, which is available to authorized users. 相似文献

14.

Census-based rapid and accurate metagenome taxonomic profiling

Amirhossein Shamsaddini Yang Pan W Evan Johnson Konstantinos Krampis Mariya Shcheglovitova Vahan Simonyan Amy Zanne Raja Mazumder 《BMC genomics》2014,15(1)

Background

Understanding the taxonomic composition of a sample, whether from patient, food or environment, is important to several types of studies including pathogen diagnostics, epidemiological studies, biodiversity analysis and food quality regulation. With the decreasing costs of sequencing, metagenomic data is quickly becoming the preferred typed of data for such analysis.

Results

Rapidly defining the taxonomic composition (both taxonomic profile and relative frequency) in a metagenomic sequence dataset is challenging because the task of mapping millions of sequence reads from a metagenomic study to a non-redundant nucleotide database such as the NCBI non-redundant nucleotide database (nt) is a computationally intensive task. We have developed a robust subsampling-based algorithm implemented in a tool called CensuScope meant to take a ‘sneak peak’ into the population distribution and estimate taxonomic composition as if a census was taken of the metagenomic landscape. CensuScope is a rapid and accurate metagenome taxonomic profiling tool that randomly extracts a small number of reads (based on user input) and maps them to NCBI’s nt database. This process is repeated multiple times to ascertain the taxonomic composition that is found in majority of the iterations, thereby providing a robust estimate of the population and measures of the accuracy for the results.

Conclusion

CensuScope can be run on a laptop or on a high-performance computer. Based on our analysis we are able to provide some recommendations in terms of the number of sequence reads to analyze and the number of iterations to use. For example, to quantify taxonomic groups present in the sample at a level of 1% or higher a subsampling size of 250 random reads with 50 iterations yields a statistical power of >99%. Windows and UNIX versions of CensuScope are available for download at https://hive.biochemistry.gwu.edu/dna.cgi?cmd=censuscope. CensuScope is also available through the High-performance Integrated Virtual Environment (HIVE) and can be used in conjunction with other HIVE analysis and visualization tools.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-918) contains supplementary material, which is available to authorized users. 相似文献

15.

Bis-class: a new classification tool of methylation status using bayes classifier and local methylation information

Iksoo Huh Xingyu Yang Taesung Park Soojin V Yi 《BMC genomics》2014,15(1)

Background

Whole genome sequencing of bisulfite converted DNA (‘methylC-seq’) method provides comprehensive information of DNA methylation. An important application of these whole genome methylation maps is classifying each position as a methylated versus non-methylated nucleotide. A widely used current method for this purpose, the so-called binomial method, is intuitive and straightforward, but lacks power when the sequence coverage and the genome-wide methylation level are low. These problems present a particular challenge when analyzing sparsely methylated genomes, such as those of many invertebrates and plants.

Results

We demonstrate that the number of sequence reads per position from methylC-seq data displays a large variance and can be modeled as a shifted negative binomial distribution. We also show that DNA methylation levels of adjacent CpG sites are correlated, and this similarity in local DNA methylation levels extends several kilobases. Taking these observations into account, we propose a new method based on Bayesian classification to infer DNA methylation status while considering the neighborhood DNA methylation levels of a specific site. We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates. This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.

Conclusions

Our method improves the existing binomial method for binary methylation calls by utilizing a posterior odds framework and incorporating local methylation information. This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes. Bis-Class and example data are provided at a dedicated website (http://bibs.snu.ac.kr/software/Bisclass).

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-608) contains supplementary material, which is available to authorized users. 相似文献

16.

Identification of indels in next-generation sequencing data

Aakrosh Ratan Thomas L Olson Thomas P Loughran Jr Webb Miller 《BMC bioinformatics》2015,16(1)

Background

The discovery and mapping of genomic variants is an essential step in most analysis done using sequencing reads. There are a number of mature software packages and associated pipelines that can identify single nucleotide polymorphisms (SNPs) with a high degree of concordance. However, the same cannot be said for tools that are used to identify the other types of variants. Indels represent the second most frequent class of variants in the human genome, after single nucleotide polymorphisms. The reliable detection of indels is still a challenging problem, especially for variants that are longer than a few bases.

Results

We have developed a set of algorithms and heuristics collectively called indelMINER to identify indels from whole genome resequencing datasets using paired-end reads. indelMINER uses a split-read approach to identify the precise breakpoints for indels of size less than a user specified threshold, and supplements that with a paired-end approach to identify larger variants that are frequently missed with the split-read approach. We use simulated and real datasets to show that an implementation of the algorithm performs favorably when compared to several existing tools.

Conclusions

indelMINER can be used effectively to identify indels in whole-genome resequencing projects. The output is provided in the VCF format along with additional information about the variant, including information about its presence or absence in another sample. The source code and documentation for indelMINER can be freely downloaded from www.bx.psu.edu/miller_lab/indelMINER.tar.gz.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0483-6) contains supplementary material, which is available to authorized users. 相似文献

17.

Effective Extraction and Assembly Methods for Simultaneously Obtaining Plastid and Mitochondrial Genomes

Wanjun Hao Shihang Fan Wei Hua Hanzhong Wang 《PloS one》2014,9(9)

Background

In conventional approaches to plastid and mitochondrial genome sequencing, the sequencing steps are performed separately; thus, plastid DNA (ptDNA) and mitochondrial DNA (mtDNA) should be prepared independently. However, it is difficult to extract pure ptDNA and mtDNA from plant tissue. Following the development of high-throughput sequencing technology, many researchers have attempted to obtain plastid genomes or mitochondrial genomes using high-throughput sequencing data from total DNA. Unfortunately, the huge datasets generated consume massive computing and storage resources and cost a great deal, and even more importantly, excessive pollution reads affect the accuracy of the assembly. Therefore, it is necessary to develop an effective method that can generate base sequences from plant tissue and that is suitable for all plant species. Here, we describe a highly effective, low-cost method for obtaining plastid and mitochondrial genomes simultaneously.

Results

First, we obtained high-quality DNA employing Partial Concentration Extraction. Second, we evaluated the purity of the DNA sample and determined the sequencing dataset size employing Vector Control Quantitative Analysis. Third, paired-end reads were obtained using a high-throughput sequencing platform. Fourth, we obtained scaffolds employing Two-step Assembly. Finally, we filled in gaps using specific methods and obtained complete plastid and mitochondrial genomes. To ensure the accuracy of plastid and mitochondrial genomes, we validated the assembly using PCR and Sanger sequencing. Using this method,we obtained complete plastid and mitochondrial genomes with lengths of 153,533 nt and 223,412 nt separately.

Conclusion

A simple method for extracting, evaluating, sequencing and assembling plastid and mitochondrial genomes was developed. This method has many advantages: it is timesaving, inexpensive and reproducible and produces high-quality sequence. Furthermore, this method can produce plastid and mitochondrial genomes simultaneously and be used for other plant species. Due to its simplicity and extensive applicability, this method will support research on plant cytoplasmic genomes. 相似文献

18.

Light-weight reference-based compression of FASTQ data

Yongpeng Zhang Linsen Li Yanli Yang Xiao Yang Shan He Zexuan Zhu 《BMC bioinformatics》2015,16(1)

Background

The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference.

Results

This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms.

Conclusions

LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip. 相似文献

19.

Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing

Ting-Wen Chen Ruei-Chi Gan Yi-Feng Chang Wei-Chao Liao Timothy H. Wu Chi-Ching Lee Po-Jung Huang Cheng-Yang Lee Yi-Ywan M. Chen Cheng-Hsun Chiu Petrus Tang 《BMC genomics》2015,16(1)

Background

Whole genome sequence construction is becoming increasingly feasible because of advances in next generation sequencing (NGS), including increasing throughput and read length. By simply overlapping paired-end reads, we can obtain longer reads with higher accuracy, which can facilitate the assembly process. However, the influences of different library sizes and assembly methods on paired-end sequencing-based de novo assembly remain poorly understood.

Results

We used 250 bp Illumina Miseq paired-end reads of different library sizes generated from genomic DNA from Escherichia coli DH1 and Streptococcus parasanguinis FW213 to compare the assembly results of different library sizes and assembly approaches. Our data indicate that overlapping paired-end reads can increase read accuracy but sometimes cause insertion or deletions. Regarding genome assembly, merged reads only outcompete original paired-end reads when coverage depth is low, and larger libraries tend to yield better assembly results. These results imply that distance information is the most critical factor during assembly. Our results also indicate that when depth is sufficiently high, assembly from subsets can sometimes produce better results.

Conclusions

In summary, this study provides systematic evaluations of de novo assembly from paired end sequencing data. Among the assembly strategies, we find that overlapping paired-end reads is not always beneficial for bacteria genome assembly and should be avoided or used with caution especially for genomes containing high fraction of repetitive sequences. Because increasing numbers of projects aim at bacteria genome sequencing, our study provides valuable suggestions for the field of genomic sequence construction.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1859-8) contains supplementary material, which is available to authorized users. 相似文献

20.

The Feasibility Study of Non-Invasive Fetal Trisomy 18 and 21 Detection with Semiconductor Sequencing Platform

Young Joo Jeon Yulin Zhou Yihan Li Qiwei Guo Jinchun Chen Shengmao Quan Ahong Zhang Hailing Zheng Xingqiang Zhu Jin Lin Huan Xu Ayang Wu Sin-Gi Park Byung Chul Kim Hee Jae Joo Hongliang Chen Jong Bhak 《PloS one》2014,9(10)

Objective

Recent non-invasive prenatal testing (NIPT) technologies are based on next-generation sequencing (NGS). NGS allows rapid and effective clinical diagnoses to be determined with two common sequencing systems: Illumina and Ion Torrent platforms. The majority of NIPT technology is associated with Illumina platform. We investigated whether fetal trisomy 18 and 21 were sensitively and specifically detectable by semiconductor sequencer: Ion Proton.

Methods

From March 2012 to October 2013, we enrolled 155 pregnant women with fetuses who were diagnosed as high risk of fetal defects at Xiamen Maternal & Child Health Care Hospital (Xiamen, Fujian, China). Adapter-ligated DNA libraries were analyzed by the Ion Proton™ System (Life Technologies, Grand Island, NY, USA) with an average 0.3× sequencing coverage per nucleotide. Average total raw reads per sample was 6.5 million and mean rate of uniquely mapped reads was 59.0%. The results of this study were derived from BWA mapping. Z-score was used for fetal trisomy 18 and 21 detection.

Results

Interactive dot diagrams showed the minimal z-score values to discriminate negative versus positive cases of fetal trisomy 18 and 21. For fetal trisomy 18, the minimal z-score value of 2.459 showed 100% positive predictive and negative predictive values. The minimal z-score of 2.566 was used to classify negative versus positive cases of fetal trisomy 21.

Conclusion

These results provide the evidence that fetal trisomy 18 and 21 detection can be performed with semiconductor sequencer. Our data also suggest that a prospective study should be performed with a larger cohort of clinically diverse obstetrics patients. 相似文献