首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
  相似文献   

2.
3.
4.
5.
Numerous water relation studies have used thermocouple hygrometers routinely. However, the accurate temperature correction of hygrometer calibration curve slopes seems to have been largely neglected in both psychrometric and dewpoint techniques.  相似文献   

6.
Protein-mediated error correction for de novo DNA synthesis   总被引:5,自引:2,他引:3       下载免费PDF全文
The availability of inexpensive, on demand synthetic DNA has enabled numerous powerful applications in biotechnology, in turn driving considerable present interest in the de novo synthesis of increasingly longer DNA constructs. The synthesis of DNA from oligonucleotides into products even as large as small viral genomes has been accomplished. Despite such achievements, the costs and time required to generate such long constructs has, to date, precluded gene-length (and longer) DNA synthesis from being an everyday research tool in the same manner as PCR and DNA sequencing. A critical barrier to low-cost, high-throughput de novo DNA synthesis is the frequency at which errors pervade the final product. Here, we employ a DNA mismatch-binding protein, MutS (from Thermus aquaticus) to remove failure products from synthetic genes. This method reduced errors by >15-fold relative to conventional gene synthesis techniques, yielding DNA with one error per 10000 base pairs. The approach is general, scalable and can be iterated multiple times for greater fidelity. Reductions in both costs and time required are demonstrated for the synthesis of a 2.5 kb gene.  相似文献   

7.
8.

Background

Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.

Results

In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.

Conclusion

Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
  相似文献   

9.
There has been a rapid proliferation of approaches for processing and manipulating second generation DNA sequence data. However, users are often left with uncertainties about how the choice of processing methods may impact biological interpretation of data. In this report, we probe differences in output between two different processing pipelines: a de-noising approach using the AmpliconNoise algorithm for error correction, and a standard approach using quality filtering and preclustering to reduce error. There was a large overlap in reads culled by each method, although AmpliconNoise removed a greater net number of reads. Most OTUs produced by one method had a clearly corresponding partner in the other. Although each method resulted in OTUs consisting entirely of reads that were culled by the other method, there were many more such OTUs formed in the standard pipeline. Total OTU richness was reduced by AmpliconNoise processing, but per-sample OTU richness, diversity and evenness were increased. Increases in per-sample richness and diversity may be a result of AmpliconNoise processing producing a more even OTU rank-abundance distribution. Because communities were randomly subsampled to equalize sample size across communities, and because rare sequence variants are less likely to be selected during subsampling, fewer OTUs were lost from individual communities when subsampling AmpliconNoise-processed data. In contrast to taxon-based diversity estimates, phylogenetic diversity was reduced even on a per-sample basis by de-noising, and samples switched widely in diversity rankings. This work illustrates the significant impacts of processing pipelines on the biological interpretations that can be made from pyrosequencing surveys. This study provides important cautions for analyses of contemporary data, for requisite data archiving (processed vs. non-processed data), and for drawing comparisons among studies performed using distinct data processing pipelines.  相似文献   

10.
Generation of high-affinity monoclonal antibodies by immunization of chickens is a valuable strategy, particularly for obtaining antibodies directed against epitopes that are conserved in mammals. A generic procedure is established for the humanization of chicken-derived antibodies. To this end, high-affinity binders of the epidermal growth factor receptor extracellular domain are isolated from immunized chickens using yeast surface display. Complementarity determining regions (CDRs) of two high-affinity binders are grafted onto a human acceptor framework. Simultaneously, Vernier zone residues, responsible for spatial CDR arrangement, are partially randomized. A yeast surface display library comprising ≈300 000 variants is screened for high-affinity binders in the scFv and Fab formats. Next-generation sequencing discloses humanized antibody variants with restored affinity and improved protein characteristics compared to the parental chicken antibodies. Furthermore, the sequencing data give new insights into the importance of antibody format, used during the humanization process. Starting from the antibody repertoire of immunized chickens, this work features an effective and fast high-throughput approach for the generation of multiple humanized antibodies with potential therapeutic relevance.  相似文献   

11.
Stiber M 《Bio Systems》2007,89(1-3):24-29
This paper presents an investigation into the responses of neurons to errors in presynaptic spike trains. Errors are viewed, in nonlinear dynamical terms, as brief-duration changes in stationary presynaptic spike trains which induce transient responses in the postsynaptic cell. As these are generally large-magnitude transients, linearized neural models are not helpful. Instead, the responses of a full, nonlinear physiological model of a neuron that includes the recognized living prototype of an inhibitory synapse are analyzed. More specifically, the transients are examined in the context of the stationary behaviors that precede and succeed each error. It is shown that one and two dimensional bifurcation diagrams can be constructed from the transient responses--that there are marked changes in the transient responses at points that correspond to bifurcations in the stationary responses, qualitative changes in transients on either side of bifurcations, and only quantitative changes in transients between bifurcations.  相似文献   

12.
EST clustering error evaluation and correction   总被引:4,自引:0,他引:4  
MOTIVATION: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.  相似文献   

13.
Stratified Cox regression models with large number of strata and small stratum size are useful in many settings, including matched case-control family studies. In the presence of measurement error in covariates and a large number of strata, we show that extensions of existing methods fail either to reduce the bias or to correct the bias under nonsymmetric distributions of the true covariate or the error term. We propose a nonparametric correction method for the estimation of regression coefficients, and show that the estimators are asymptotically consistent for the true parameters. Small sample properties are evaluated in a simulation study. The method is illustrated with an analysis of Framingham data.  相似文献   

14.
Efficient measurement error correction with spatially misaligned data   总被引:1,自引:0,他引:1  
Association studies in environmental statistics often involve exposure and outcome data that are misaligned in space. A common strategy is to employ a spatial model such as universal kriging to predict exposures at locations with outcome data and then estimate a regression parameter of interest using the predicted exposures. This results in measurement error because the predicted exposures do not correspond exactly to the true values. We characterize the measurement error by decomposing it into Berkson-like and classical-like components. One correction approach is the parametric bootstrap, which is effective but computationally intensive since it requires solving a nonlinear optimization problem for the exposure model parameters in each bootstrap sample. We propose a less computationally intensive alternative termed the "parameter bootstrap" that only requires solving one nonlinear optimization problem, and we also compare bootstrap methods to other recently proposed methods. We illustrate our methodology in simulations and with publicly available data from the Environmental Protection Agency.  相似文献   

15.
16.
Binary regression models for spatial data are commonly used in disciplines such as epidemiology and ecology. Many spatially referenced binary data sets suffer from location error, which occurs when the recorded location of an observation differs from its true location. When location error occurs, values of the covariates associated with the true spatial locations of the observations cannot be obtained. We show how a change of support (COS) can be applied to regression models for binary data to provide coefficient estimates when the true values of the covariates are unavailable, but the unknown location of the observations are contained within nonoverlapping arbitrarily shaped polygons. The COS accommodates spatial and nonspatial covariates and preserves the convenient interpretation of methods such as logistic and probit regression. Using a simulation experiment, we compare binary regression models with a COS to naive approaches that ignore location error. We illustrate the flexibility of the COS by modeling individual-level disease risk in a population using a binary data set where the locations of the observations are unknown but contained within administrative units. Our simulation experiment and data illustration corroborate that conventional regression models for binary data that ignore location error are unreliable, but that the COS can be used to eliminate bias while preserving model choice.  相似文献   

17.
The disulphide bond-introducing enzyme of bacteria, DsbA, sometimes oxidizes non-native cysteine pairs. DsbC should rearrange the resulting incorrect disulphide bonds into those with correct connectivity. DsbA and DsbC receive oxidizing and reducing equivalents, respectively, from respective redox components (quinones and NADPH) of the cell. Two mechanisms of disulphide bond rearrangement have been proposed. In the redox-neutral 'shuffling' mechanism, the nucleophilic cysteine in the DsbC active site forms a mixed disulphide with a substrate and induces disulphide shuffling within the substrate part of the enzyme–substrate complex, followed by resolution into a reduced enzyme and a disulphide-rearranged substrate. In the 'reduction–oxidation' mechanism, DsbC reduces those substrates with wrong disulphides so that DsbA can oxidize them again. In this issue of Molecular Microbiology , Berkmen and his collaborators show that a disulphide reductase, TrxP, from an anaerobic bacterium can substitute for DsbC in Escherichia coli . They propose that the reduction–oxidation mechanism of disulphide rearrangement can indeed operate in vivo . An implication of this work is that correcting errors in disulphide bonds can be coupled to cellular metabolism and is conceptually similar to the proofreading processes observed with numerous synthesis and maturation reactions of biological macromolecules.  相似文献   

18.
19.
Haplotype reconstruction from SNP fragments by minimum error correction   总被引:5,自引:0,他引:5  
MOTIVATION: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. RESULTS: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model.  相似文献   

20.
Jain et al. introduced the Local Pooled Error (LPE) statistical test designed for use with small sample size microarray gene-expression data. Based on an asymptotic proof, the test multiplicatively adjusts the standard error for a test of differences between two classes of observations by pi/2 due to the use of medians rather than means as measures of central tendency. The adjustment is upwardly biased at small sample sizes, however, producing fewer than expected small P-values with a consequent loss of statistical power. We present an empirical correction to the adjustment factor which removes the bias and produces theoretically expected P-values when distributional assumptions are met. Our adjusted LPE measure should prove useful to ongoing methodological studies designed to improve the LPE's; performance for microarray and proteomics applications and for future work for other high-throughput biotechnologies. AVAILABILITY: The software is implemented in the R language and can be downloaded from the Bioconductor project website (http://www.bioconductor.org).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号