期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mining statistically-solid <Emphasis Type="Italic">k</Emphasis>-mers for accurate NGS error correction

Liang Zhao Jin Xie Lin Bai Wen Chen Mingju Wang Zhonglei Zhang Yiqi Wang Zhe Zhao Jinyan Li 《BMC genomics》2018,19(10):912

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f₀ to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f₀. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

相似文献

2.

HiTEC: accurate error correction in high-throughput sequencing data

Ilie L Fazayeli F Ilie S 《Bioinformatics (Oxford, England)》2011,27(3):295-302

相似文献

3.

Probabilistic error correction for RNA sequencing

Hai-Son Le Marcel H. Schulz Brenna M. McCauley Veronica F. Hinman Ziv Bar-Joseph 《Nucleic acids research》2013,41(10):e109

相似文献

4.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

Rachid Ounit Steve Wanamaker Timothy J Close Stefano Lonardi 《BMC genomics》2015,16(1)

相似文献

5.

An accurate temperature correction model for thermocouple hygrometers 总被引：1，自引：1，他引：0

下载免费PDF全文

Savage MJ Cass A de Jager JM 《Plant physiology》1982,69(2):526-530

Numerous water relation studies have used thermocouple hygrometers routinely. However, the accurate temperature correction of hygrometer calibration curve slopes seems to have been largely neglected in both psychrometric and dewpoint techniques. 相似文献

6.

Protein-mediated error correction for de novo DNA synthesis 总被引：3，自引：2，他引：3

下载免费PDF全文

Carr PA Park JS Lee YJ Yu T Zhang S Jacobson JM 《Nucleic acids research》2004,32(20):e162

The availability of inexpensive, on demand synthetic DNA has enabled numerous powerful applications in biotechnology, in turn driving considerable present interest in the de novo synthesis of increasingly longer DNA constructs. The synthesis of DNA from oligonucleotides into products even as large as small viral genomes has been accomplished. Despite such achievements, the costs and time required to generate such long constructs has, to date, precluded gene-length (and longer) DNA synthesis from being an everyday research tool in the same manner as PCR and DNA sequencing. A critical barrier to low-cost, high-throughput de novo DNA synthesis is the frequency at which errors pervade the final product. Here, we employ a DNA mismatch-binding protein, MutS (from Thermus aquaticus) to remove failure products from synthetic genes. This method reduced errors by >15-fold relative to conventional gene synthesis techniques, yielding DNA with one error per 10000 base pairs. The approach is general, scalable and can be iterated multiple times for greater fidelity. Reductions in both costs and time required are demonstrated for the synthesis of a 2.5 kb gene. 相似文献

7.

Phosphorolytic error correction during transcription 总被引：2，自引：0，他引：2

Randell T. Libby Jonathan A. Gallant 《Molecular microbiology》1994,12(1):121-129

相似文献

8.

Jabba: hybrid error correction for long sequencing reads

Giles Miclotte Mahdi Heydari Piet Demeester Stephane Rombauts Yves Van de Peer Jan Fostier 《Algorithms for molecular biology : AMB》2016,11(1):10

Background

Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.

Results

In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.

Conclusion

Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.

相似文献

9.

Implications of pyrosequencing error correction for biological data interpretation

MG Bakker ZJ Tu JM Bradeen LL Kinkel 《PloS one》2012,7(8):e44357

There has been a rapid proliferation of approaches for processing and manipulating second generation DNA sequence data. However, users are often left with uncertainties about how the choice of processing methods may impact biological interpretation of data. In this report, we probe differences in output between two different processing pipelines: a de-noising approach using the AmpliconNoise algorithm for error correction, and a standard approach using quality filtering and preclustering to reduce error. There was a large overlap in reads culled by each method, although AmpliconNoise removed a greater net number of reads. Most OTUs produced by one method had a clearly corresponding partner in the other. Although each method resulted in OTUs consisting entirely of reads that were culled by the other method, there were many more such OTUs formed in the standard pipeline. Total OTU richness was reduced by AmpliconNoise processing, but per-sample OTU richness, diversity and evenness were increased. Increases in per-sample richness and diversity may be a result of AmpliconNoise processing producing a more even OTU rank-abundance distribution. Because communities were randomly subsampled to equalize sample size across communities, and because rare sequence variants are less likely to be selected during subsampling, fewer OTUs were lost from individual communities when subsampling AmpliconNoise-processed data. In contrast to taxon-based diversity estimates, phylogenetic diversity was reduced even on a per-sample basis by de-noising, and samples switched widely in diversity rankings. This work illustrates the significant impacts of processing pipelines on the biological interpretations that can be made from pyrosequencing surveys. This study provides important cautions for analyses of contemporary data, for requisite data archiving (processed vs. non-processed data), and for drawing comparisons among studies performed using distinct data processing pipelines. 相似文献

10.

Transient bifurcations in neural error correction

Stiber M 《Bio Systems》2007,89(1-3):24-29

This paper presents an investigation into the responses of neurons to errors in presynaptic spike trains. Errors are viewed, in nonlinear dynamical terms, as brief-duration changes in stationary presynaptic spike trains which induce transient responses in the postsynaptic cell. As these are generally large-magnitude transients, linearized neural models are not helpful. Instead, the responses of a full, nonlinear physiological model of a neuron that includes the recognized living prototype of an inhibitory synapse are analyzed. More specifically, the transients are examined in the context of the stationary behaviors that precede and succeed each error. It is shown that one and two dimensional bifurcation diagrams can be constructed from the transient responses--that there are marked changes in the transient responses at points that correspond to bifurcations in the stationary responses, qualitative changes in transients on either side of bifurcations, and only quantitative changes in transients between bifurcations. 相似文献

11.

EST clustering error evaluation and correction 总被引：4，自引：0，他引：4

Wang JP Lindsay BG Leebens-Mack J Cui L Wall K Miller WC dePamphilis CW 《Bioinformatics (Oxford, England)》2004,20(17):2973-2984

MOTIVATION: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile. 相似文献

12.

Nonparametric correction for covariate measurement error in a stratified Cox model

Gorfine M Hsu L Prentice RL 《Biostatistics (Oxford, England)》2004,5(1):75-87

Stratified Cox regression models with large number of strata and small stratum size are useful in many settings, including matched case-control family studies. In the presence of measurement error in covariates and a large number of strata, we show that extensions of existing methods fail either to reduce the bias or to correct the bias under nonsymmetric distributions of the true covariate or the error term. We propose a nonparametric correction method for the estimation of regression coefficients, and show that the estimators are asymptotically consistent for the true parameters. Small sample properties are evaluated in a simulation study. The method is illustrated with an analysis of Framingham data. 相似文献

13.

Efficient measurement error correction with spatially misaligned data 总被引：1，自引：0，他引：1

Szpiro AA Sheppard L Lumley T 《Biostatistics (Oxford, England)》2011,12(4):610-623

Association studies in environmental statistics often involve exposure and outcome data that are misaligned in space. A common strategy is to employ a spatial model such as universal kriging to predict exposures at locations with outcome data and then estimate a regression parameter of interest using the predicted exposures. This results in measurement error because the predicted exposures do not correspond exactly to the true values. We characterize the measurement error by decomposing it into Berkson-like and classical-like components. One correction approach is the parametric bootstrap, which is effective but computationally intensive since it requires solving a nonlinear optimization problem for the exposure model parameters in each bootstrap sample. We propose a less computationally intensive alternative termed the "parameter bootstrap" that only requires solving one nonlinear optimization problem, and we also compare bootstrap methods to other recently proposed methods. We illustrate our methodology in simulations and with publicly available data from the Environmental Protection Agency. 相似文献

14.

NGS technologies for analyzing germplasm diversity in genebanks 总被引：1，自引：0，他引：1

Kilian B Graner A 《Briefings in functional genomics》2012,11(1):38-50

More than 70 years after the first ex situ genebanks have been established, major efforts in this field are still concerned with issues related to further completion of individual collections and securing of their storage. Attempts regarding valorization of ex situ collections for plant breeders have been hampered by the limited availability of phenotypic and genotypic information. With the advent of molecular marker technologies first efforts were made to fingerprint genebank accessions, albeit on a very small scale and mostly based on inadequate DNA marker systems. Advances in DNA sequencing technology and the development of high-throughput systems for multiparallel interrogation of thousands of single nucleotide polymorphisms (SNPs) now provide a suite of technological platforms facilitating the analysis of several hundred of Gigabases per day using state-of-the-art sequencing technology or, at the same time, of thousands of SNPs. The present review summarizes recent developments regarding the deployment of these technologies for the analysis of plant genetic resources, in order to identify patterns of genetic diversity, map quantitative traits and mine novel alleles from the vast amount of genetic resources maintained in genebanks around the world. It also refers to the various shortcomings and bottlenecks that need to be overcome to leverage the full potential of high-throughput DNA analysis for the targeted utilization of plant genetic resources. 相似文献

15.

Haplotype reconstruction from SNP fragments by minimum error correction 总被引：5，自引：0，他引：5

Wang RS Wu LY Li ZP Zhang XS 《Bioinformatics (Oxford, England)》2005,21(10):2456-2462

MOTIVATION: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. RESULTS: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model. 相似文献

16.

It's all relative: Centromere- versus pole-based error correction

Anna A Ye Thomas J Maresca 《Cell cycle (Georgetown, Tex.)》2015,14(24):3777-3778

相似文献

17.

Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS

Juan Pablo Tosar Carlos Rovira Hugo Naya Alfonso Cayota 《RNA (New York, N.Y.)》2014,20(6):754-757

相似文献

18.

Microvessel diameter estimation: error bias correction of serial measurements

F P Miles A L Nuttall 《Biorheology》1991,28(3-4):315-332

The assessment of vessel patency can be substantially improved by serial microvessel diameter measurements taken successively along an extensive length of the vessel. It is possible to avoid making the a priori assumptions about the existence or location of local constriction sites implicit in single diameter measurements. The problem then becomes one of making sense of tens or hundreds of measurements for each vessel. Equivalent diameter is defined here as as the diameter of a uniform circular cylinder of the same length as the original vessel, and having the same total resistance. Direct computation of the equivalent diameter, without taking measurement errors into account, leads to an underestimation of the true equivalent diameter even if the individual diameter measurements were not biased. We have developed a method for effectively eliminating this bias. It has been applied to serial microvessel diameter measurements of the guinea pig cochlea, automatically measured using an image analysis system. In this report, the results were developed for diameter estimates with an approximate gaussian distribution; however the method is readily extended to other error distributions. Convergence of the bias compensation was rapid. Use of the new method is advisable with as few as three diameter estimates per vessel. 相似文献

19.

NARWHAL, a primary analysis pipeline for NGS data

Brouwer RW van den Hout MC Grosveld FG van Ijcken WF 《Bioinformatics (Oxford, England)》2012,28(2):284-285

The NARWHAL software pipeline has been developed to automate the primary analysis of Illumina sequencing data. This pipeline combines a new and flexible de-multiplexing tool with open-source aligners and automated quality assessment. The entire pipeline can be run using only one simple sample-sheet for diverse sequencing applications. NARWHAL creates a sample-oriented data structure and outperforms existing tools in speed. AVAILABILITY: https://trac.nbic.nl/narwhal/. 相似文献

20.

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Koren S Schatz MC Walenz BP Martin J Howard JT Ganapathy G Wang Z Rasko DA McCombie WR Jarvis ED Adam M Phillippy 《Nature biotechnology》2012,30(7):693-700

相似文献