首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Next-generation sequencing technologies can be used to analyse genetically heterogeneous samples at unprecedented detail. The high coverage achievable with these methods enables the detection of many low-frequency variants. However, sequencing errors complicate the analysis of mixed populations and result in inflated estimates of genetic diversity. We developed a probabilistic Bayesian approach to minimize the effect of errors on the detection of minority variants. We applied it to pyrosequencing data obtained from a 1.5-kb-fragment of the HIV-1 gag/pol gene in two control and two clinical samples. The effect of PCR amplification was analysed. Error correction resulted in a two- and five-fold decrease of the pyrosequencing base substitution rate, from 0.05% to 0.03% and from 0.25% to 0.05% in the non-PCR and PCR-amplified samples, respectively. We were able to detect viral clones as rare as 0.1% with perfect sequence reconstruction. Probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall. Genetic diversity observed within and between two clinical samples resulted in various patterns of phenotypic drug resistance and suggests a close epidemiological link. We conclude that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated.  相似文献   

2.
Automated correction of genome sequence errors   总被引:3,自引:0,他引:3       下载免费PDF全文
By using information from an assembly of a genome, a new program called AutoEditor significantly improves base calling accuracy over that achieved by previous algorithms. This in turn improves the overall accuracy of genome sequences and facilitates the use of these sequences for polymorphism discovery. We describe the algorithm and its application in a large set of recent genome sequencing projects. The number of erroneous base calls in these projects was reduced by 80%. In an analysis of over one million corrections, we found that AutoEditor made just one error per 8828 corrections. By substantially increasing the accuracy of base calling, AutoEditor can dramatically accelerate the process of finishing genomes, which involves closing all gaps and ensuring minimum quality standards for the final sequence. It also greatly improves our ability to discover single nucleotide polymorphisms (SNPs) between closely related strains and isolates of the same species.  相似文献   

3.
Removing Noise From Pyrosequenced Amplicons   总被引:2,自引:0,他引:2  

Background  

In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.  相似文献   

4.
Insertions and deletions (indels) are important types of structural variations. Obtaining accurate genotypes of indels may facilitate further genetic study. There are a few existing methods for calling indel genotypes from sequence reads. However, none of these tools can accurately call indel genotypes for indels of all lengths, especially for low coverage sequence data. In this paper, we present GINDEL, an approach for calling genotypes of both insertions and deletions from sequence reads. GINDEL uses a machine learning approach which combines multiple features extracted from next generation sequencing data. We test our approach on both simulated and real data and compare with existing tools, including Genome STRiP, Pindel and Clever-sv. Results show that GINDEL works well for deletions larger than 50 bp on both high and low coverage data. Also, GINDEL performs well for insertion genotyping on both simulated and real data. For comparison, Genome STRiP performs less well for shorter deletions (50–200 bp) on both simulated and real sequence data from the 1000 Genomes Project. Clever-sv performs well for intermediate deletions (200–1500 bp) but is less accurate when coverage is low. Pindel only works well for high coverage data, but does not perform well at low coverage. To summarize, we show that GINDEL not only can call genotypes of insertions and deletions (both short and long) for high and low coverage population sequence data, but also is more accurate and efficient than other approaches. The program GINDEL can be downloaded at: http://sourceforge.net/p/gindel  相似文献   

5.

Background

Next generation sequencing (NGS) platforms are currently being utilized for targeted sequencing of candidate genes or genomic intervals to perform sequence-based association studies. To evaluate these platforms for this application, we analyzed human sequence generated by the Roche 454, Illumina GA, and the ABI SOLiD technologies for the same 260 kb in four individuals.

Results

Local sequence characteristics contribute to systematic variability in sequence coverage (>100-fold difference in per-base coverage), resulting in patterns for each NGS technology that are highly correlated between samples. A comparison of the base calls to 88 kb of overlapping ABI 3730xL Sanger sequence generated for the same samples showed that the NGS platforms all have high sensitivity, identifying >95% of variant sites. At high coverage, depth base calling errors are systematic, resulting from local sequence contexts; as the coverage is lowered additional 'random sampling' errors in base calling occur.

Conclusions

Our study provides important insights into systematic biases and data variability that need to be considered when utilizing NGS platforms for population targeted sequencing studies.  相似文献   

6.
Pyrosequencing is a versatile technique that facilitates microbial genome sequencing that can be used to identify bacterial species, discriminate bacterial strains and detect genetic mutations that confer resistance to anti-microbial agents. The advantages of pyrosequencing for microbiology applications include rapid and reliable high-throughput screening and accurate identification of microbes and microbial genome mutations. Pyrosequencing involves sequencing of DNA by synthesizing the complementary strand a single base at a time, while determining the specific nucleotide being incorporated during the synthesis reaction. The reaction occurs on immobilized single stranded template DNA where the four deoxyribonucleotides (dNTP) are added sequentially and the unincorporated dNTPs are enzymatically degraded before addition of the next dNTP to the synthesis reaction. Detection of the specific base incorporated into the template is monitored by generation of chemiluminescent signals. The order of dNTPs that produce the chemiluminescent signals determines the DNA sequence of the template. The real-time sequencing capability of pyrosequencing technology enables rapid microbial identification in a single assay. In addition, the pyrosequencing instrument, can analyze the full genetic diversity of anti-microbial drug resistance, including typing of SNPs, point mutations, insertions, and deletions, as well as quantification of multiple gene copies that may occur in some anti-microbial resistance patterns.  相似文献   

7.
根据鼠伤寒沙门氏菌的特异序列,分别设计扩增引物和测序引物,建立焦磷酸测序检测鼠伤寒沙门氏菌的方法。针对鼠伤寒沙门氏菌设计特异性扩增引物,对目标片段进行PCR扩增,然后制备单链模板,并利用测序引物进行焦磷酸测序。测序结果表明,6株不同来源的鼠伤寒沙门氏菌均可以扩增出碱基序列为TACAACCGGA GTGCACATTA ATCCCGCAGC的基因片段,而30株阴性对照菌株均未得到扩增。进行BLAST比对表明,该序列与GenBank中鼠伤寒沙门氏菌的碱基序列100%匹配。焦磷酸测序法是一种快速、准确的检测方法,可用于食品中鼠伤寒沙门氏菌的快速检测。  相似文献   

8.
The recent emergence of barcoding approaches coupled to those of next‐generation sequencing (NGS) has raised new perspectives for studying environmental communities. In this framework, we tested the possibility to derive accurate inventories of diatom communities from pyrosequencing outputs with an available DNA reference library. We used three molecular markers targeting the nuclear, chloroplast and mitochondrial genomes (SSU rDNA, rbcL and cox1) and three samples of a mock community composed of 30 known diatom strains belonging to 21 species. In the goal to detect methodological biases, one sample was constituted directly from pooled cultures, whereas the others consisted of pooled PCR products. The NGS reads obtained by pyrosequencing (Roche 454) were compared first to a DNA reference library including the sequences of all the species used to constitute the mock community, and second to a complete DNA reference library with a larger taxonomic coverage. A stringent taxonomic assignation gave inventories that were compared to the real one. We detected biases due to DNA extraction and PCR amplification that resulted in false‐negative detection. Conversely, pyrosequencing errors appeared to generate false positives, especially in case of closely allied species. The taxonomic coverage of DNA reference libraries appears to be the most crucial factor, together with marker polymorphism which is essential to identify taxa at the species level. RbcL offers a high resolving power together with a large DNA reference library. Although needing further optimization, pyrosequencing is suitable for identifying diatom assemblages and may find applications in the field of freshwater biomonitoring.  相似文献   

9.
The SFF file format produced by Roche's 454 sequencing technology is a compact, binary format that contains the flow values that are used for base and quality calling of the reads. Applications, e.g. in metagenomics, often depend on accurate sequence information, and access to flow values is important to estimate the probability of errors. Unfortunately, the programs supplied by Roche for accessing this information are not publicly available. Flower is a program that can extract the information contained in SFF files, and convert it to various textual output formats. AVAILABILITY: Flower is freely available under the General Public License.  相似文献   

10.
11.
Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.  相似文献   

12.
The X-linked dystrophin gene is well known for its involvement in Duchenne/Becker muscular dystrophies and for its exceptional megabase size. This locus at Xp21 is prone to frequent random molecular changes, including large deletions and duplications, but also smaller variations. To cope with such huge sequence analysis requirements in forthcoming diagnostic applications, we employed the power of the parallel 454 GS-FLX pyrosequencer to the dystrophin locus. We enriched the genomic region of interest by the robust amplification of 62 fragments under universal conditions by the long-PCR protocol yielding 244,707 bp of sequence. Pooled PCR products were fragmented and used for library preparation and DNA sequencing. To evaluate the entire procedure we analyzed four male DNA samples for sequence coverage and accuracy in DNA sequence variation and for any potential bias. We identified 562 known variations and 55 additional variants not yet reported, among which we detected a causative Arg1844Stop mutation in one sample. Sanger sequencing confirmed all changes. Unexpectedly, only 3× coverage was sufficient for 99.9993% accuracy. Our results show that long PCR combined to massive pyrosequencing is very reliable for the analysis of the biggest gene of the human genome and open the doors to other demanding applications in molecular diagnostics.  相似文献   

13.
A genotyping by sequencing (GbS) approach is reported in blackcurrant (Ribes nigrum L.) using a de novo read assembly method developed because of the current absence of a reference genome sequence for this species. A new approach to single nucleotide polymorphism (SNP) genotype calling is described, where individual genotypes for a large number of SNPs were characterised from the GbS counts using a novel method based on a functional regression of major and minor allele read counts. The high-quality GbS SNPs were combined with SNPs and simple sequence repeats generated from other technologies to develop a linkage map with increased marker density and improved genome coverage, containing up to 204 SNPs on each linkage group. SNPs of lower quality were then located on the map using quantitative trait locus (QTL) interval mapping of the proportion of the major allele. Two QTL each for 100-berry weight and Brix scores, measured over three years, were identified using the map. The use of this approach to identify and map a significant number of novel SNPs in a woody species with hitherto limited genomic resources may have generic application to other under-resourced and minor species in the development of cost-effective and efficient high-density genetic maps.  相似文献   

14.
J K Bonfield  C Rada    R Staden 《Nucleic acids research》1998,26(14):3404-3409
The final step in the detection of mutations is to determine the sequence of the suspected mutant and to compare it with that of the wild-type, and for this fluorescence-based sequencing instruments are widely used. We describe some simple algorithms forcomparing sequence traces which, as part of our sequence assembly and analysis package, are proving useful for the discovery of mutations and which may also help to identify misplaced readings in sequence assembly projects. The mutations can be detected automatically by a new program called TRACE_DIFF and new types of trace display in our program GAP4 greatly simplify visual checking of the assigned changes. To assess the accuracy of the automatic mutation detection algorithm we analysed 214 sequence readings from hypermutating DNA comprising a total of 108 497 bases. After the readings were assembled there were 1232 base differences, including 392 Ns and 166 alignment characters. Visual inspection of the traces established that of the 1232 differences, 353 were real mutations while the rest were due to base calling errors. The TRACE_DIFF algorithm automatically identified all but 36, with 28 false positives. Further information about the software can be obtained from http://www.mrc-lmb.cam.ac.uk/pubseq/  相似文献   

15.
The success of comparative analysis in resolving RNA secondary structure and numerous tertiary interactions relies on the presence of base covariations. Although the majority of base covariations in aligned sequences is associated to Watson-Crick base pairs, many involve non-canonical or restricted base pair exchanges (e.g. only G:C/A:U), reflecting more specific structural constraints. We have developed a computer program that determines potential base pairing conformations for a given set of paired nucleotides in a sequence alignment. This program (ISOPAIR) assumes that the base pair conformation is maintained through sequence variation without significantly affecting the path of the sugar-phosphate backbone. ISOPAIR identifies such 'isomorphic' structures for any set of input base pair or base triple sequences. The program was applied to base pairs and triples with known structures and sequence exchanges. In several instances, isomorphic structures were correctly identified with ISOPAIR. Thus, ISOPAIR is useful when assessing non-canonical base pair conformations in comparative analysis. ISOPAIR applications are limited to those cases where unusual base pair exchanges indeed reflect a non-canonical conformation.  相似文献   

16.
MOTIVATION: Sequencing of a bi-allelic PCR product, which contains an allele with a deletion/insertion mutation results in a superimposed tracefile following the site of this shift mutation. A trace file of this type hampers the use of current computer programs for base calling. ShiftDetector analyses a sequencing trace file in order to discover if it is a superimposed sequence of two molecules that differ in a shift mutation of 1 to 25 bases. The program calculates a probability score for the existence of such a shift and reconstructs the sequence of the original molecule. AVAILABILITY: ShiftDetector is available from http://cowry.agri.huji.ac.il  相似文献   

17.
Summary Second‐generation sequencing (sec‐gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T's, between 30 and 100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base‐calling. The complexity of the base‐calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across‐sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec‐gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base‐calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base‐calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base‐calling performance.  相似文献   

18.
We describe a novel polymerase chain reaction (PCR) and deoxyribonucleic acid (DNA) sequencingbased assay for rapid genotyping of the polymorphic Sp1 binding site in the COL1A1 gene (1). A single nucleotide G-->T substitution polymorphism at this GC-rich site has recently been reported to be a predictive genetic marker for low bone mineral density (BMD). To simplify screening for this marker, we optimized PCR conditions and subjected the amplicons to pyrosequencing, which is a convenient high-throughput sequence analysis technique, readily amenable to automation. The analysis of 200 deidentified convenience DNA samples extracted from blood revealed genotype frequences in Hardy-Weinberg equilibrium (SS 68.0%, Ss 28.5%, and ss 3.5%) in agreement with other studies of European populations. This study demonstrates for the first time that pyrosequencing can be used for rapid identification of the osteoporosis-associated single nucleotide polymorphism (SNP) in the COL1A1 gene.  相似文献   

19.
Culturing many obligate intracellular bacteria is difficult or impossible. However, these organisms have numerous adaptations allowing for infection persistence and immune system evasion, making them some of the most interesting to study. Recent advancements in genome sequencing, pyrosequencing and Phi29 amplification, have allowed for examination of whole-genome sequences of intracellular bacteria without culture. We have applied both techniques to the model obligate intracellular pathogen Anaplasma marginale and the human pathogen Anaplasma phagocytophilum, in order to examine the ability of phi29 amplification to determine the sequence of genes allowing for immune system evasion and long-term persistence in the host. When compared to traditional pyrosequencing, phi29-mediated genome amplification had similar genome coverage, with no additional gaps in coverage. Additionally, all msp2 functional pseudogenes from two strains of A. marginale were detected and extracted from the phi29-amplified genomes, highlighting its utility in determining the full complement of genes involved in immune evasion.  相似文献   

20.
MOTIVATION: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions. RESULTS: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号