首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.  相似文献   

2.

Background

High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked.

Results

We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others.

Conclusions

The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.  相似文献   

3.
With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/  相似文献   

4.

Background

The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.

Methodology

We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions

We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.  相似文献   

5.
The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms).  相似文献   

6.
Accelerated in vitro release testing methodology has been developed as an indicator of product performance to be used as a discriminatory quality control (QC) technique for the release of clinical and commercial batches of biodegradable microspheres. While product performance of biodegradable microspheres can be verified by in vivo and/or in vitro experiments, such evaluation can be particularly challenging because of slow polymer degradation, resulting in extended study times, labor, and expense. Three batches of Leuprolide poly(lactic-co-glycolic acid) (PLGA) microspheres having varying morphology (process variants having different particle size and specific surface area) were manufactured by the solvent extraction/evaporation technique. Tests involving in vitro release, polymer degradation and hydration of the microspheres were performed on the three batches at 55°C. In vitro peptide release at 55°C was analyzed using a previously derived modification of the Weibull function termed the modified Weibull equation (MWE). Experimental observations and data analysis confirm excellent reproducibility studies within and between batches of the microsphere formulations demonstrating the predictability of the accelerated experiments at 55°C. The accelerated test method was also successfully able to distinguish the in vitro product performance between the three batches having varying morphology (process variants), indicating that it is a suitable QC tool to discriminate product or process variants in clinical or commercial batches of microspheres. Additionally, data analysis utilized the MWE to further quantify the differences obtained from the accelerated in vitro product performance test between process variants, thereby enhancing the discriminatory power of the accelerated methodology at 55°C.  相似文献   

7.
Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.  相似文献   

8.
9.
The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed. We present FastUniq as a fast de novo tool for removal of duplicates in paired short reads. FastUniq identifies duplicates by comparing sequences between read pairs and does not require complete genome sequences as prerequisites. FastUniq is capable of simultaneously handling reads with different lengths and results in highly efficient running time, which increases linearly at an average speed of 87 million reads per 10 minutes. FastUniq is freely available at http://sourceforge.net/projects/fastuniq/.  相似文献   

10.
An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm''s utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.  相似文献   

11.
The presented study describes the development of a membrane permeation non-sink dissolution method that can provide analysis of complete drug speciation and emulate the in vivo performance of poorly water-soluble Biopharmaceutical Classification System class II compounds. The designed membrane permeation methodology permits evaluation of free/dissolved/unbound drug from amorphous solid dispersion formulations with the use of a two-cell apparatus, biorelevant dissolution media, and a biomimetic polymer membrane. It offers insight into oral drug dissolution, permeation, and absorption. Amorphous solid dispersions of felodipine were prepared by hot melt extrusion and spray drying techniques and evaluated for in vitro performance. Prior to ranking performance of extruded and spray-dried felodipine solid dispersions, optimization of the dissolution methodology was performed for parameters such as agitation rate, membrane type, and membrane pore size. The particle size and zeta potential were analyzed during dissolution experiments to understand drug/polymer speciation and supersaturation sustainment of felodipine solid dispersions. Bland-Altman analysis was performed to measure the agreement or equivalence between dissolution profiles acquired using polymer membranes and porcine intestines and to establish the biomimetic nature of the treated polymer membranes. The utility of the membrane permeation dissolution methodology is seen during the evaluation of felodipine solid dispersions produced by spray drying and hot melt extrusion. The membrane permeation dissolution methodology can suggest formulation performance and be employed as a screening tool for selection of candidates to move forward to pharmacokinetic studies. Furthermore, the presented model is a cost-effective technique.  相似文献   

12.
Defining the architecture of a specific cancer genome, including its structural variants, is essential for understanding tumor biology, mechanisms of oncogenesis, and for designing effective personalized therapies. Short read paired-end sequencing is currently the most sensitive method for detecting somatic mutations that arise during tumor development. However, mapping structural variants using this method leads to a large number of false positive calls, mostly due to the repetitive nature of the genome and the difficulty of assigning correct mapping positions to short reads. This study describes a method to efficiently identify large tumor-specific deletions, inversions, duplications and translocations from low coverage data using SVDetect or BreakDancer software and a set of novel filtering procedures designed to reduce false positive calls. Applying our method to a spontaneous T cell lymphoma arising in a core RAG2/p53-deficient mouse, we identified 40 validated tumor-specific structural rearrangements supported by as few as 2 independent read pairs.  相似文献   

13.
Modern applications of Sanger DNA sequencing often require converting a large number of chromatogram trace files into high-quality DNA sequences for downstream analyses. Relatively few nonproprietary software tools are available to assist with this process. SeqTrace is a new, free, and open-source software application that is designed to automate the entire workflow by facilitating easy batch processing of large numbers of trace files. SeqTrace can identify, align, and compute consensus sequences from matching forward and reverse traces, filter low-quality base calls, and end-trim finished sequences. The software features a graphical interface that includes a full-featured chromatogram viewer and sequence editor. SeqTrace runs on most popular operating systems and is freely available, along with supporting documentation, at http://seqtrace.googlecode.com/.  相似文献   

14.
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.  相似文献   

15.
16.
A hybrid de novo assembly pipeline was constructed to utilize both MiSeq and SOLiD short read data in combination in the assembly. The short read data were converted to a standard format of the pipeline, and were supplied to the pipeline components such as ABySS and SOAPdenovo. The assembly pipeline proceeded through several stages, and either MiSeq paired-end data, SOLiD mate-paired data, or both of them could be specified as input data at each stage separately. The pipeline was examined on the filamentous fungus Aspergillus oryzae RIB40, by aligning the assembly results against the reference sequences. Using both the MiSeq and the SOLiD data in the hybrid assembly, the alignment length was improved by a factor of 3 to 8, compared with the assemblies using either one of the data types. The number of the reproduced gene cluster regions encoding secondary metabolite biosyntheses (SMB) was also improved by the hybrid assemblies. These results imply that the MiSeq data with long read length are essential to construct accurate nucleotide sequences, while the SOLiD mate-paired reads with long insertion length enhance long-range arrangements of the sequences. The pipeline was also tested on the actinomycete Streptomyces avermitilis MA-4680, whose gene is known to have high-GC content. Although the quality of the SOLiD reads was too low to perform any meaningful assemblies by themselves, the alignment length to the reference was improved by a factor of 2, compared with the assembly using only the MiSeq data.  相似文献   

17.
Pursuit of the triple bottom line of economic, community and ecological sustainability has increased the complexity of fishery management; fisheries assessments require new types of data and analysis to guide science-based policy in addition to traditional biological information and modeling. We introduce the Fishery Performance Indicators (FPIs), a broadly applicable and flexible tool for assessing performance in individual fisheries, and for establishing cross-sectional links between enabling conditions, management strategies and triple bottom line outcomes. Conceptually separating measures of performance, the FPIs use 68 individual outcome metrics—coded on a 1 to 5 scale based on expert assessment to facilitate application to data poor fisheries and sectors—that can be partitioned into sector-based or triple-bottom-line sustainability-based interpretative indicators. Variation among outcomes is explained with 54 similarly structured metrics of inputs, management approaches and enabling conditions. Using 61 initial fishery case studies drawn from industrial and developing countries around the world, we demonstrate the inferential importance of tracking economic and community outcomes, in addition to resource status.  相似文献   

18.
Ion exchange chromatography has emerged as a reliable alternative to classic CsCl-ethidium bromide gradients for isolating nucleic acids of the highest purity. A plasmid purification method based on a unique anion exchange membrane (IEXM) was developed for the production of superior quality plasmids. This method was simpler and more efficient than conventional bead-based methods. Plasmids were extracted from bacterial cells through alkaline lysis. The crude lysate was clarified by a sequential filtration device that not only removed cell debris but micellar aggregates as well. The clarified lysate was mixed with an extraction solution and loaded into a spin column containing IEXM. Binding, washing, and elution conditions were optimized to achieve efficient isolation of plasmids from the impurities. IEXM had an exceedingly high dynamic binding capacity, excellent selectivity, and a near 100% recovery for plasmids. The binding capacity for pUC19 was 2.93 mg/cm3 of IEXM, which is several times greater than the values for conventional ion exchange beads. The superior selectivity of the method was reflected in the extremely low levels of endotoxin, and thus it is well-suited for critical applications in eukaryotic systems.  相似文献   

19.
Optical mapping by direct visualization of individual DNA molecules, stretched in nanochannels with sequence-specific fluorescent labeling, represents a promising tool for disease diagnostics and genomics. An important challenge for this technique is thermal motion of the DNA as it undergoes imaging; this blurs fluorescent patterns along the DNA and results in information loss. Correcting for this effect (a process referred to as kymograph alignment) is a common preprocessing step in nanochannel-based optical mapping workflows, and we present here a highly efficient algorithm to accomplish this via pattern recognition. We compare our method with the one previous approach, and we find that our method is orders of magnitude faster while producing data of similar quality. We demonstrate proof of principle of our approach on experimental data consisting of melt mapped bacteriophage DNA.  相似文献   

20.
In this issue of The Journal, an article by Schalkwyk et al.1 shows the landscape of allele-specific DNA methylation (ASM) in the human genome. ASM has long been studied as a hallmark of imprinted genes, and a chromosome-wide version of this phenomenon occurs, in a random fashion, during X chromosome inactivation in female cells. But the type of ASM motivating the study by Schalkwyk et al. is different. They used a high-resolution, methylation-sensitive SNP array (MSNP) method for genome-wide profiling of ASM in total peripheral-blood leukocytes (PBL) and buccal cells from a series of monozygotic twin pairs. Their data bring a new level of detail to our knowledge of a newly recognized phenomenon—nonimprinted, sequence-dependent ASM. They document the widespread occurrence of this phenomenon among human genes and discuss its basic implications for gene regulation and genetic-epigenetic interactions. But this paper and recent work from other laboratories2,3 raises the possibility of a more immediate and practical application for ASM mapping, namely to help extract maximum information from genome-wide association studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号