首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
DNA sequencing with positive and negative errors.   总被引:7,自引:0,他引:7  
The problem addressed in this paper is concerned with DNA sequencing by hybridization. An algorithm is proposed that solves a computational phase of this approach in the presence of both positive and negative errors resulting from the hybridization experiment. No a priori knowledge of the nature and source of these errors is required. An extensive set of computational experiments showed that the algorithm behaves surprisingly well if only positive errors appear. The general case, where positive and negative errors occur, can be also solved satisfactorily for an error rate up to 10%.  相似文献   

2.
MOTIVATION: A new heuristic algorithm for solving DNA sequencing by hybridization problem with positive and negative errors. RESULTS: A heuristic algorithm providing better solutions than algorithms known from the literature based on tabu search method.  相似文献   

3.
Sequencing by hybridization (SBH) is a DNA sequencing technique, in which the sequence is reconstructed using its k-mer content. This content, which is called the spectrum of the sequence, is obtained by hybridization to a universal DNA array. Standard universal arrays contain all k-mers for some fixed k, typically 8 to 10. Currently, in spite of its promise and elegance, SBH is not competitive with standard gel-based sequencing methods. This is due to two main reasons: lack of tools to handle realistic levels of hybridization errors and an inherent limitation on the length of uniquely reconstructible sequence by standard universal arrays. In this paper, we deal with both problems. We introduce a simple polynomial reconstruction algorithm which can be applied to spectra from standard arrays and has provable performance in the presence of both false negative and false positive errors. We also propose a novel design of chips containing universal bases that differs from the one proposed by Preparata et al. (1999). We give a simple algorithm that uses spectra from such chips to reconstruct with high probability random sequences of length lower only by a squared log factor compared to the information theoretic bound. Our algorithm is very robust to errors and has a provable performance even if there are both false negative and false positive errors. Simulations indicate that its sensitivity to errors is also very small in practice.  相似文献   

4.
MOTIVATION: A realistic approach to sequencing by hybridization must deal with realistic sequencing errors. The results of such a method can surely be applied to similar sequencing tasks. RESULTS: We provide the first algorithms for interactive sequencing by hybridization which are robust in the presence of hybridization errors. Under a strong error model allowing both positive and negative hybridization errors without repeated queries, we demonstrate accurate and efficient reconstruction with error rates up to 7%. Under the weaker traditional error model of Shamir and Tsur (Proceedings of the Fifth International Conference on Computational Molecular Biology (RECOMB-01), pp 269-277, 2000), we obtain accurate reconstructions with up to 20% false negative hybridization errors. Finally, we establish theoretical bounds on the performance of the sequential probing algorithm of Skiena and Sundaram (J. Comput. Biol., 2, 333-353, 1995) under the strong error model. AVAILABILTY: Freely available upon request. CONTACT: skiena@cs.sunysb.edu.  相似文献   

5.
1-Tuple DNA sequencing: computer analysis   总被引:3,自引:0,他引:3  
A new method of DNA reading was proposed at the end of 1988 by Lysov et al. According to the authors' claims it has certain advantages as compared to the Maxam-Gilbert and Sanger methods, which are revealed by automation and rapidity of DNA sequencing. Nevertheless its employment is hampered by a number of biological and mathematical problems. The present study proposes an algorithm that allows to overcome the computational difficulties occurring in the course of the method during reconstruction of the DNA sequence by its l-tuple composition. It is shown also that the biochemical problems connected with the loss of information about the l-tuple DNA composition during hybridization are not crucial and can be overcome by finding the maximal flow of minimal cost in the special graph.  相似文献   

6.
Sequencing by hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4(9) 9-mers and assuming no hybridization errors, only about 400-bases-long sequences can be reconstructed unambiguously. Drmanac et al. (1989) suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target, and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper, we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + O(p = (1 - p)(4). 1 = d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on an errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. (1996). Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50% false negative errors, a target of cosmid length can be recovered with less than 0.1% miscalled bases.  相似文献   

7.
Bhandarkar SM  Machaka SA  Shete SS  Kota RN 《Genetics》2001,157(3):1021-1043
Reconstructing a physical map of a chromosome from a genomic library presents a central computational problem in genetics. Physical map reconstruction in the presence of errors is a problem of high computational complexity that provides the motivation for parallel computing. Parallelization strategies for a maximum-likelihood estimation-based approach to physical map reconstruction are presented. The estimation procedure entails a gradient descent search for determining the optimal spacings between probes for a given probe ordering. The optimal probe ordering is determined using a stochastic optimization algorithm such as simulated annealing or microcanonical annealing. A two-level parallelization strategy is proposed wherein the gradient descent search is parallelized at the lower level and the stochastic optimization algorithm is simultaneously parallelized at the higher level. Implementation and experimental results on a distributed-memory multiprocessor cluster running the parallel virtual machine (PVM) environment are presented using simulated and real hybridization data.  相似文献   

8.

Background

DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence.

Results

Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized k-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a k-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of k-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm.

Conclusions

The novel generalized k-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.  相似文献   

9.
In this paper we describe a method for the statistical reconstruction of a large DNA sequence from a set of sequenced fragments. We assume that the fragments have been assembled and address the problem of determining the degree to which the reconstructed sequence is free from errors, i.e., its accuracy. A consensus distribution is derived from the assembled fragment configuration based upon the rates of sequencing errors in the individual fragments. The consensus distribution can be used to find a minimally redundant consensus sequence that meets a prespecified confidence level, either base by base or across any region of the sequence. A likelihood-based procedure for the estimation of the sequencing error rates, which utilizes an iterative EM algorithm, is described. Prior knowledge of the error rates is easily incorporated into the estimation procedure. The methods are applied to a set of assembled sequence fragments from the human G6PD locus. We close the paper with a brief discussion of the relevance and practical implications of this work.  相似文献   

10.
Abstract

In this paper we consider the efficiency of additional rounds of “continuous stacking” hybridization in DNA sequence reconstruction by hybridization with oligonucleotide matrix (SHOM). After the initial hybridization of target DNA with the matrix of oligonucleotides of fixed length L some additional hybridizations should be carried out in the presence of fluorescently labeled oligonucleotides of another length l. These additional oligonucleotides can hybridize in tandem with matrix tuples (continuous stacking hybridization) thus forming an extended duplex with the target DNA strand. The additional data obtained allows resolutions of branching points arising in the reconstruction procedure. Multiple rounds of continuous stacking hybridization considerably increase the efficiency of the sequencing method, eventually approaching the power of (L+l)-matrix. We develop here an algorithm that allows us to minimize the number of additional hybridization steps, by assembling sets of l-tuples to be added together in each round of continuous stacking hybridization. For SHOM using a matrix of octanucleotides, continuous stacking hybridization with pen- tanucleotides increases the length of unambiguously sequenced DNA from 200 to several thousands of base pairs.  相似文献   

11.
MOTIVATION: Developing a new method of assembling small sequences based on sequencing by hybridization with many positive and negative faults. First, an interpretation of a generic traveling salesman problem is provided (i.e. finding the shortest route for visiting many cities), using genetic algorithms. Second, positive errors are excluded before assembly by a sanitization process. RESULTS: The present method outperforms those described in previous studies, in terms of both time and accuracy. AVAILABILITY: http://kamit.med.u-tokai.ac.jp/~takaho/sbh/index.html  相似文献   

12.
MOTIVATION: With the potential availability of nanopore devices that can sense the bases of translocating single-stranded DNA (ssDNA), it is likely that 'reads' of length approximately 10(5) will be available in large numbers and at high speed. We address the problem of complete DNA sequencing using such reads.We assume that approximately 10(2) copies of a DNA sequence are split into single strands that break into randomly sized pieces as they translocate the nanopore in arbitrary orientations. The nanopore senses and reports each individual base that passes through, but all information about orientation and complementarity of the ssDNA subsequences is lost. Random errors (both biological and transduction) in the reads create further complications. RESULTS: We have developed an algorithm that addresses these issues. It can be considered an extreme variation of the well-known Eulerian path approach. It searches over a space of de Bruijn graphs until it finds one in which (a) the impact of errors is eliminated and (b) both possible orientations of the two ssDNA sequences can be identified separately and unambiguously.Our algorithm is able to correctly reconstruct real DNA sequences of the order of 10(6) bases (e.g. the bacterium Mycoplasma pneumoniae) from simulated erroneous reads on a modest workstation in about 1 h. We describe, and give measured timings of, a parallel implementation of this algorithm on the Cray Multithreaded Architecture (MTA-2) supercomputer, whose architecture is ideally suited to this 'unstructured' problem. Our parallel implementation is crucial to the problem of rapidly sequencing long DNA sequences and also to the situation where multiple nanopores are used to obtain a high-bandwidth stream of reads.  相似文献   

13.
Oligonucleotide fingerprinting is a powerful DNA array-based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our preliminary experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.  相似文献   

14.
Haplotype reconstruction from SNP fragments by minimum error correction   总被引:5,自引:0,他引:5  
MOTIVATION: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. RESULTS: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model.  相似文献   

15.
Background: The reconstruction of clonal haplotypes and their evolutionary history in evolving populations is a common problem in both microbial evolutionary biology and cancer biology. The clonal theory of evolution provides a theoretical framework for modeling the evolution of clones.Results: In this paper, we review the theoretical framework and assumptions over which the clonal reconstruction problem is formulated. We formally define the problem and then discuss the complexity and solution space of the problem. Various methods have been proposed to find the phylogeny that best explains the observed data. We categorize these methods based on the type of input data that they use (space-resolved or time-resolved), and also based on their computational formulation as either combinatorial or probabilistic. It is crucial to understand the different types of input data because each provides essential but distinct information for drastically reducing the solution space of the clonal reconstruction problem. Complementary information provided by single cell sequencing or from whole genome sequencing of randomly isolated clones can also improve the accuracy of clonal reconstruction. We briefly review the existing algorithms and their relationships. Finally we summarize the tools that are developed for either directly solving the clonal reconstruction problem or a related computational problem.Conclusions: In this review, we discuss the various formulations of the problem of inferring the clonal evolutionary history from allele frequeny data, review existing algorithms and catergorize them according to their problem formulation and solution approaches. We note that most of the available clonal inference algorithms were developed for elucidating tumor evolution whereas clonal reconstruction for unicellular genomes are less addressed. We conclude the review by discussing more open problems such as the lack of benchmark datasets and comparison of performance between available tools.  相似文献   

16.

Background

The popularity of new sequencing technologies has led to an explosion of possible applications, including new approaches in biodiversity studies. However each of these sequencing technologies suffers from sequencing errors originating from different factors. For 16S rRNA metagenomics studies, the 454 pyrosequencing technology is one of the most frequently used platforms, but sequencing errors still lead to important data analysis issues (e.g. in clustering in taxonomic units and biodiversity estimation). Moreover, retaining a higher portion of the sequencing data by preserving as much of the read length as possible while maintaining the error rate within an acceptable range, will have important consequences at the level of taxonomic precision.

Results

The new error correction algorithm proposed in this work - NoDe (Noise Detector) - is trained to identify those positions in 454 sequencing reads that are likely to have an error, and subsequently clusters those error-prone reads with correct reads resulting in error-free representative read. A benchmarking study with other denoising algorithms shows that NoDe can detect up to 75% more errors in a large scale mock community dataset, and this with a low computational cost compared to the second best algorithm considered in this study. The positive effect of NoDe in 16S rRNA studies was confirmed by the beneficial effect on the precision of the clustering of pyrosequencing reads in operational taxonomic units.

Conclusions

NoDe was shown to be a computational efficient denoising algorithm for pyrosequencing reads, producing the lowest error rates in an extensive benchmarking study with other denoising algorithms.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0520-5) contains supplementary material, which is available to authorized users.  相似文献   

17.
We present an original approach to identifying sequence variants in a mixed DNA population from sequence trace data. The heart of the method is based on parsimony: given a wildtype DNA sequence, a set of observed variations at each position collected from sequencing data, and a complete catalog of all possible mutations, determine the smallest set of mutations from the catalog that could fully explain the observed variations. The algorithmic complexity of the problem is analyzed for several classes of mutations, including block substitutions, single-range deletions, and single-range insertions. The reconstruction problem is shown to be NP-complete for single-range insertions and deletions, while for block substitutions, single character insertion, and single character deletion mutations, polynomial time algorithms are provided. Once a minimum set of mutations compatible with the observed sequence is found, the relative frequency of those mutations is recovered by solving a system of linear equations. Simulation results show the algorithm successfully deconvolving mutations in p53 known to cause cancer. An extension of the algorithm is proposed as a new method of high throughput screening for single nucleotide polymorphisms by multiplexing DNA.  相似文献   

18.
On the complexity of positional sequencing by hybridization.   总被引:2,自引:0,他引:2  
In sequencing by hybridization (SBH), one has to reconstruct a sequence from its l-long substrings. SBH was proposed as an alternative to gel-based DNA sequencing approaches, but in its original form the method is not competitive. Positional SBH (PSBH) is a recently proposed enhancement of SBH in which one has additional information about the possible positions of each substring along the target sequence. We give a linear time algorithm for solving PSBH when each substring has at most two possible positions. On the other hand, we prove that the problem is NP-complete if each substring has at most three possible positions. We also show that PSBH is NP-complete if the set of allowed positions for each substring is an interval of length k and provide a fast algorithm for the latter problem when k is bounded.  相似文献   

19.
A new algorithm for the construction of physical maps from hybridization fingerprints of short oligonucleotide probes has been developed. Extensive simulations in high-noise scenarios show that the algorithm produces an essentially completely correct map in over 95% of trials. Tests for the influence of specific experimental parameters demonstrate that the algorithm is robust to both false positive and false negative experimental errors. The algorithm was also tested in simulations using real DNA sequences of C. elegans, E. coli, S. cerevisiae, and H. sapiens. To overcome the non-randomness of probe frequencies in these sequences, probes were preselected based on sequence statistics and a screening process of the hybridization data was developed. With these modifications, the algorithm produced very encouraging results.  相似文献   

20.
The problem tackled here concerns the feasibility of DNA sequencingusing hybridization methods. We establish algorithms for andcomputational limitations to the reconstruction of a sequencefrom all its subsequences having the same length: in other words,the building of a string that contains all the words of a givenset, and only these ones. Generally there are several possiblestrings. We refer to graph theory and propose an algorithm toenumerate all the strings that are solutions. We then carriedout stimulations using real DNA sequences. They provided somenecessary conditions and give some upper bounds to the lengthof the sequence to recover in relation with the length of oligonucleotides.To avoid limiting ourselves to problems that admit a uniquesolution, we introduce another algorithm that produces a signaturefor each solution string. Each signature can be tested to determinewhich one belongs to the correct sequence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号