首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The well-known massively parallel sequencing method is efficient and it can obtain sequence data from multiple individual samples. In order to ensure that sequencing, replication, and oligonucleotide synthesis errors do not result in tags (or barcodes) that are unrecoverable or confused, the tag sequences should be abundant and sufficiently different. Recently, many design methods have been proposed for correcting errors in data using error-correcting codes. The existing tag sets contain small tag sequences, so we used a modified genetic algorithm to improve the lower bound of the tag sets in this study. Compared with previous research, our algorithm is effective for designing sets of DNA tags. Moreover, the GC content determined by existing methods includes an imprecise range. Thus, we improved the GC content determination method to obtain tag sets that control the GC content in a more precise range. Finally, previous studies have only considered perfect self-complementarity. Thus, we considered the crossover between different tags and introduced an improved constraint into the design of tag sets.  相似文献   

2.
We consider the design and evaluation of short barcodes, with a length between six and eight nucleotides, used for parallel sequencing on platforms where substitution errors dominate. Such codes should have not only good error correction properties but also the code words should fulfil certain biological constraints (experimental parameters). We compare published barcodes with codes obtained by two new constructions methods, one based on the currently best known linear codes and a simple randomized construction method. The evaluation done is with respect to the error correction capabilities, barcode size and their experimental parameters and fundamental bounds on the code size and their distance properties. We provide a list of codes for lengths between six and eight nucleotides, where for length eight, two substitution errors can be corrected. In fact, no code with larger minimum distance can exist.  相似文献   

3.
We constructed error-correcting DNA barcodes that allow one run of a massively parallel pyrosequencer to process up to 1,544 samples simultaneously. Using these barcodes we processed bacterial 16S rRNA gene sequences representing microbial communities in 286 environmental samples, corrected 92% of sample assignment errors, and thus characterized nearly as many 16S rRNA genes as have been sequenced to date by Sanger sequencing.  相似文献   

4.
Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).  相似文献   

5.

Background  

DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.  相似文献   

6.
Previously reported applications of the 454 Life Sciences pyrosequencing technology have relied on deep sequence coverage for accurate polymorphism discovery because of frequent insertion and deletion sequence errors. Here we report a new base calling program, Pyrobayes, for pyrosequencing reads. Pyrobayes permits accurate single-nucleotide polymorphism (SNP) calling in resequencing applications, even in shallow read coverage, primarily because it produces more confident base calls than the native base calling program.  相似文献   

7.
Barcoded vectors are promising tools for investigating clonal diversity and dynamics in hematopoietic gene therapy. Analysis of clones marked with barcoded vectors requires accurate identification of potentially large numbers of individually rare barcodes, when the exact number, sequence identity and abundance are unknown. This is an inherently challenging application, and the feasibility of using contemporary next-generation sequencing technologies is unresolved. To explore this potential application empirically, without prior assumptions, we sequenced barcode libraries of known complexity. Libraries containing 1, 10 and 100 Sanger-sequenced barcodes were sequenced using an Illumina platform, with a 100-barcode library also sequenced using a SOLiD platform. Libraries containing 1 and 10 barcodes were distinguished from false barcodes generated by sequencing error by a several log-fold difference in abundance. In 100-barcode libraries, however, expected and false barcodes overlapped and could not be resolved by bioinformatic filtering and clustering strategies. In independent sequencing runs multiple false-positive barcodes appeared to be represented at higher abundance than known barcodes, despite their confirmed absence from the original library. Such errors, which potentially impact barcoding studies in an application-dependent manner, are consistent with the existence of both stochastic and systematic error, the mechanism of which is yet to be fully resolved.  相似文献   

8.
We present an original approach to identifying sequence variants in a mixed DNA population from sequence trace data. The heart of the method is based on parsimony: given a wildtype DNA sequence, a set of observed variations at each position collected from sequencing data, and a complete catalog of all possible mutations, determine the smallest set of mutations from the catalog that could fully explain the observed variations. The algorithmic complexity of the problem is analyzed for several classes of mutations, including block substitutions, single-range deletions, and single-range insertions. The reconstruction problem is shown to be NP-complete for single-range insertions and deletions, while for block substitutions, single character insertion, and single character deletion mutations, polynomial time algorithms are provided. Once a minimum set of mutations compatible with the observed sequence is found, the relative frequency of those mutations is recovered by solving a system of linear equations. Simulation results show the algorithm successfully deconvolving mutations in p53 known to cause cancer. An extension of the algorithm is proposed as a new method of high throughput screening for single nucleotide polymorphisms by multiplexing DNA.  相似文献   

9.
For many parallel applications of Next-Generation Sequencing (NGS) technologies short barcodes able to accurately multiplex a large number of samples are demanded. To address these competitive requirements, the use of error-correcting codes is advised. Current barcoding systems are mostly built from short random error-correcting codes, a feature that strongly limits their multiplexing accuracy and experimental scalability. To overcome these problems on sequencing systems impaired by mismatch errors, the alternative use of binary BCH and pseudo-quaternary Hamming codes has been proposed. However, these codes either fail to provide a fine-scale with regard to size of barcodes (BCH) or have intrinsic poor error correcting abilities (Hamming). Here, the design of barcodes from shortened binary BCH codes and quaternary Low Density Parity Check (LDPC) codes is introduced. Simulation results show that although accurate barcoding systems of high multiplexing capacity can be obtained with any of these codes, using quaternary LDPC codes may be particularly advantageous due to the lower rates of read losses and undetected sample misidentification errors. Even at mismatch error rates of 10−2 per base, 24-nt LDPC barcodes can be used to multiplex roughly 2000 samples with a sample misidentification error rate in the order of 10−9 at the expense of a rate of read losses just in the order of 10−6.  相似文献   

10.
Mononucleotide microsatellites are tandem repeats of a single base pair, abundant within coding exons and frequent sites of mutation in the human genome. Because the repeated unit is one base pair, multiple mechanisms of insertion/deletion (indel) mutagenesis are possible, including strand-slippage, dNTP-stabilized, and misincorportion-misalignment. Here, we examine the effects of polymerase identity (mammalian Pols α, β, κ, and η), template sequence, dNTP pool size, and reaction temperature on indel errors during in vitro synthesis of mononucleotide microsatellites. We utilized the ratio of insertion to deletion errors as a genetic indicator of mechanism. Strikingly, we observed a statistically significant bias toward deletion errors within mononucleotide repeats for the majority of the 28 DNA template and polymerase combinations examined, with notable exceptions based on sequence and polymerase identity. Using mutator forms of Pol β did not substantially alter the error specificity, suggesting that mispairing-misalignment mechanism is not a primary mechanism. Based on our results for mammalian DNA polymerases representing three structurally distinct families, we suggest that dNTP-stabilized mutagenesis may be an alternative mechanism for mononucleotide microsatellite indel mutation. The change from a predominantly dNTP-stabilized mechanism to a strand-slippage mechanism with increasing microsatellite length may account for the differential rates of tandem repeat mutation that are observed genome-wide.  相似文献   

11.
There is a growing and significant demand for reliable, simple and sensitive methods for repeated scanning of a given gene or gene fragment for detection and characterization of mutations. Solid-phase sequencing by single base primer extension of nested GBATM primers on miniaturized DNA arrays can be used to effectively scan targeted sequences for missense, insertion and deletion mutations. This paper describes the use of N-GBA arrays designed to scan the sequence of a 33 base region of exon 8 of the p53 gene (codons 272-282) encompassing a hot spot for mutations associated with the development of cancer. Synthetic DNA templates containing various missense, insertion and deletion mutations, as well as DNA prepared from pancreatic and biliary tumor cells, were genotyped using the exon 8 arrays.  相似文献   

12.
Three artificial neural networks (ANNs) are proposed for solving a variety of on- and off-line string matching problems. The ANN structure employed as the building block of these ANNs is derived from the harmony theory (HT) ANN, whereby the resulting string matching ANNs are characterized by fast match-mismatch decisions, low computational complexity, and activation values of the ANN output nodes that can be used as indicators of substitution, insertion (addition) and deletion spelling errors.  相似文献   

13.
BC Faircloth  TC Glenn 《PloS one》2012,7(8):e42543
Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis errors do not cause tags to be unrecoverable or confused. However, many design approaches only protect against substitution errors during sequencing and extant tag sets contain too few tag sequences. We developed an open-source software package to validate sequence tags for conformance to two distance metrics and design sequence tags robust to indel and substitution errors. We use this software package to evaluate several commercial and non-commercial sequence tag sets, design several large sets (maxcount = 7,198) of edit metric sequence tags having different lengths and degrees of error correction, and integrate a subset of these edit metric tags to polymerase chain reaction (PCR) primers and sequencing adapters. We validate a subset of these edit metric tagged PCR primers and sequencing adapters by sequencing on several platforms and subsequent comparison to commercially available alternatives. We find that several commonly used sets of sequence tags or design methodologies used to produce sequence tags do not meet the minimum expectations of their underlying distance metric, and we find that PCR primers and sequencing adapters incorporating edit metric sequence tags designed by our software package perform as well as their commercial counterparts. We suggest that researchers evaluate sequence tags prior to use or evaluate tags that they have been using. The sequence tag sets we design improve on extant sets because they are large, valid across the set, and robust to the suite of substitution, insertion, and deletion errors affecting massively parallel sequencing workflows on all currently used platforms.  相似文献   

14.
Formalin fixing with paraffin embedding (FFPE) has been a standard sample preparation method for decades, and archival FFPE samples are still very useful resources. Nonetheless, the use of FFPE samples in cancer genome analysis using next-generation sequencing, which is a powerful technique for the identification of genomic alterations at the nucleotide level, has been challenging due to poor DNA quality and artificial sequence alterations. In this study, we performed whole-exome sequencing of matched frozen samples and FFPE samples of tissues from 4 cancer patients and compared the next-generation sequencing data obtained from these samples. The major differences between data obtained from the 2 types of sample were the shorter insert size and artificial base alterations in the FFPE samples. A high proportion of short inserts in the FFPE samples resulted in overlapping paired reads, which could lead to overestimation of certain variants; >20% of the inserts in the FFPE samples were double sequenced. A large number of soft clipped reads was found in the sequencing data of the FFPE samples, and about 30% of total bases were soft clipped. The artificial base alterations, C>T and G>A, were observed in FFPE samples only, and the alteration rate ranged from 200 to 1,200 per 1M bases when sequencing errors were removed. Although high-confidence mutation calls in the FFPE samples were compatible to that in the frozen samples, caution should be exercised in terms of the artifacts, especially for low-confidence calls. Despite the clearly observed artifacts, archival FFPE samples can be a good resource for discovery or validation of biomarkers in cancer research based on whole-exome sequencing.  相似文献   

15.
Analyzing mutation spectra is a very powerful method to determine the effects of various types of DNA damage and to understand the workings of various DNA repair pathways. However, compiling sequence-specific mutation spectra is laborious; even with modern sequencing technology, it is rare to obtain spectra with more than several hundred data points. Two assay systems are described for yeast, one for insertion/deletion mutations and one for base substitution mutations, that allow determination of specific mutations without the necessity of DNA sequencing. The assay for insertion/deletion mutations uses a variety of different simple repeats placed in frame with URA3 such that insertions or deletions lead to a selectable Ura(-) phenotype; essentially all such mutations are in the simple repeat sequence. The assay for base substitution mutations uses a series of six strains with different mutations in one essential codon of the CYC1 gene. Because only true reversions lead to a selectable phenotype, the bases mutated in any reversion event are known. The advantage of these assays is that they can quantitatively determine over several orders of magnitude the types of mutations that occur under a given set of conditions, without DNA sequencing.  相似文献   

16.
Kong Y 《Genomics》2011,98(2):152-153
Btrim is a fast and lightweight software to trim adapters and low quality regions in reads from ultra high-throughput next-generation sequencing machines. It also can reliably identify barcodes and assign the reads to the original samples. Based on a modified Myers's bit-vector dynamic programming algorithm, Btrim can handle indels in adapters and barcodes. It removes low quality regions and trims off adapters at both or either end of the reads. A typical trimming of 30 M reads with two sets of adapter pairs can be done in about a minute with a small memory footprint. Btrim is a versatile stand-alone tool that can be used as the first step in virtually all next-generation sequence analysis pipelines. The program is available at http://graphics.med.yale.edu/trim/.  相似文献   

17.
Removing Noise From Pyrosequenced Amplicons   总被引:2,自引:0,他引:2  

Background  

In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.  相似文献   

18.
A protocol relying on Sanger sequencing reactions in combination with mass spectrometry (MS) for sequence confirmation of antisense phosphorothioate oligodeoxynucleotides is described. In this procedure, synthetic phosphorothioate oligodeoxynucleotides are used as reverse primers for extension of matched templates with enough length (approximately 150-300 bp) for well-established Sanger sequencing. Because the complementary strand of modified primer is used directly for sequencing primer extension, the base order shown in the sequencing result is reversely complementary to phosphorothioate oligodeoxynucleotide. This sequencing method can be applied not only to phosphorothioate oligodeoxynucleotides with different lengths (13-21 mer) and base composition but also to sequences with bases' switch, deletion, or insertion. In addition, modified primers incorporate the 5' end of polymerase chain reaction (PCR) products conveying the characters of phosphorothioate modification. The method requires only common reagents and instruments and so is better suited to routine sequence analysis in quality control of phosphorothioate antisense drugs.  相似文献   

19.
A test for nucleotide sequence homology   总被引:3,自引:0,他引:3  
Two macromolecular sequences which have evolved from a common ancestor sequence will tend to include a large number of elements unaffected by replacement mutations in both sequences, as long as the evolutionary rate is not too high or the divergence time is not too great. The positions of corresponding elements may have changed in either daughter sequence due to deletion/insertion mutations involving other sequence elements, but their order can be expected to be the same in both sequences. These sets of correspondences, called matches, may be computed by a recursive algorithm which incorporates constraints on the number of deletion/insertion mutations hypothesized to have occurred. A test is developed which computes the significance of each deletion/insertion hypothesized, based on Monte-Carlo sampling of random sequences with the same base composition as the experimental sequences being tested. Applying the test to 5 S RNAs confirms the relation of Escherichia coli and KB carcinoma 5 S RNAs and establishes the previously undetected homology between Pseudomonas fluorescens and KB 5 S RNAs.  相似文献   

20.
Next-generation sequencing (NGS) technologies have transformed genomic research and have the potential to revolutionize clinical medicine. However, the background error rates of sequencing instruments and limitations in targeted read coverage have precluded the detection of rare DNA sequence variants by NGS. Here we describe a method, termed CypherSeq, which combines double-stranded barcoding error correction and rolling circle amplification (RCA)-based target enrichment to vastly improve NGS-based rare variant detection. The CypherSeq methodology involves the ligation of sample DNA into circular vectors, which contain double-stranded barcodes for computational error correction and adapters for library preparation and sequencing. CypherSeq is capable of detecting rare mutations genome-wide as well as those within specific target genes via RCA-based enrichment. We demonstrate that CypherSeq is capable of correcting errors incurred during library preparation and sequencing to reproducibly detect mutations down to a frequency of 2.4 × 10−7 per base pair, and report the frequency and spectra of spontaneous and ethyl methanesulfonate-induced mutations across the Saccharomyces cerevisiae genome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号