首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Background

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing is revolutionising the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population. Identification of low frequency variants is important for our understanding of mutational dynamics, disease progression, immune pressure, and for the detection of drug resistant or pathogenic mutations. However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing and sample processing, which can both be substantial.

Results

We have created a novel set of laboratory control samples that are derived from a plasmid containing a full-length viral genome with extremely limited diversity in the starting population. One sample was sequenced without PCR amplification whilst the other samples were subjected to increasing amounts of RT and PCR amplification prior to ultra-deep sequencing. This enabled the level of error introduced by the RT and PCR processes to be assessed and minimum frequency thresholds to be set for true viral variant identification. We developed a genome-scale computational model of the sample processing and NGS calling process to gain a detailed understanding of the errors at each step, which predicted that RT and PCR errors are more likely to occur at some genomic sites than others. The model can also be used to investigate whether the number of observed mutations at a given site of interest is greater than would be expected from processing errors alone in any NGS data set. After providing basic sample processing information and the site’s coverage and quality scores, the model utilises the fitted RT-PCR error distributions to simulate the number of mutations that would be observed from processing errors alone.

Conclusions

These data sets and models provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1456-x) contains supplementary material, which is available to authorized users.  相似文献   

2.
Accurate detection of low frequency mutations from plasma cell-free DNA in blood using targeted next generation sequencing technology has shown promising benefits in clinical settings. Duplex sequencing technology is the most commonly used approach in liquid biopsies. Unique molecular identifiers are attached to each double-stranded DNA template, followed by production of low-error consensus sequences to detect low frequency variants. However, high sequencing costs have hindered application of this approach in clinical practice. Here, we have developed an improved duplex sequencing approach called Sino Duplex, which utilizes a pool of adapters containing pre-defined barcode sequences to generate far fewer barcode combinations than with random sequences, and implemented a novel computational analysis algorithm to generate duplex consensus sequences more precisely. Sino Duplex increased the output of duplex sequencing technology, making it more cost-effective. We evaluated our approach using reference standard samples and cell-free DNA samples from lung cancer patients. Our results showed that Sino Duplex has high sensitivity and specificity in detecting very low allele frequency mutations. The source code for Sino Duplex is freely available at https://github.com/Sin Oncology/sinoduplex.  相似文献   

3.
4.
Next generation sequencing technologies, like ultra-deep pyrosequencing (UDPS), allows detailed investigation of complex populations, like RNA viruses, but its utility is limited by errors introduced during sample preparation and sequencing. By tagging each individual cDNA molecule with barcodes, referred to as Primer IDs, before PCR and sequencing these errors could theoretically be removed. Here we evaluated the Primer ID methodology on 257,846 UDPS reads generated from a HIV-1 SG3Δenv plasmid clone and plasma samples from three HIV-infected patients. The Primer ID consisted of 11 randomized nucleotides, 4,194,304 combinations, in the primer for cDNA synthesis that introduced a unique sequence tag into each cDNA molecule. Consensus template sequences were constructed for reads with Primer IDs that were observed three or more times. Despite high numbers of input template molecules, the number of consensus template sequences was low. With 10,000 input molecules for the clone as few as 97 consensus template sequences were obtained due to highly skewed frequency of resampling. Furthermore, the number of sequenced templates was overestimated due to PCR errors in the Primer IDs. Finally, some consensus template sequences were erroneous due to hotspots for UDPS errors. The Primer ID methodology has the potential to provide highly accurate deep sequencing. However, it is important to be aware that there are remaining challenges with the methodology. In particular it is important to find ways to obtain a more even frequency of resampling of template molecules as well as to identify and remove artefactual consensus template sequences that have been generated by PCR errors in the Primer IDs.  相似文献   

5.
Zhu  Fangfang  Li  Jiang  Liu  Juan  Min  Wenwen 《BMC genetics》2021,22(1):1-10
Background

Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency.

Results

Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%.

Conclusions

For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.

  相似文献   

6.
Keightley PD  Halligan DL 《Genetics》2011,188(4):931-940
Sequencing errors and random sampling of nucleotide types among sequencing reads at heterozygous sites present challenges for accurate, unbiased inference of single-nucleotide polymorphism genotypes from high-throughput sequence data. Here, we develop a maximum-likelihood approach to estimate the frequency distribution of the number of alleles in a sample of individuals (the site frequency spectrum), using high-throughput sequence data. Our method assumes binomial sampling of nucleotide types in heterozygotes and random sequencing error. By simulations, we show that close to unbiased estimates of the site frequency spectrum can be obtained if the error rate per base read does not exceed the population nucleotide diversity. We also show that these estimates are reasonably robust if errors are nonrandom. We then apply the method to infer site frequency spectra for zerofold degenerate, fourfold degenerate, and intronic sites of protein-coding genes using the low coverage human sequence data produced by the 1000 Genomes Project phase-one pilot. By fitting a model to the inferred site frequency spectra that estimates parameters of the distribution of fitness effects of new mutations, we find evidence for significant natural selection operating on fourfold sites. We also find that a model with variable effects of mutations at synonymous sites fits the data significantly better than a model with equal mutational effects. Under the variable effects model, we infer that 11% of synonymous mutations are subject to strong purifying selection.  相似文献   

7.
Next-generation sequencing (NGS) technologies have transformed genomic research and have the potential to revolutionize clinical medicine. However, the background error rates of sequencing instruments and limitations in targeted read coverage have precluded the detection of rare DNA sequence variants by NGS. Here we describe a method, termed CypherSeq, which combines double-stranded barcoding error correction and rolling circle amplification (RCA)-based target enrichment to vastly improve NGS-based rare variant detection. The CypherSeq methodology involves the ligation of sample DNA into circular vectors, which contain double-stranded barcodes for computational error correction and adapters for library preparation and sequencing. CypherSeq is capable of detecting rare mutations genome-wide as well as those within specific target genes via RCA-based enrichment. We demonstrate that CypherSeq is capable of correcting errors incurred during library preparation and sequencing to reproducibly detect mutations down to a frequency of 2.4 × 10−7 per base pair, and report the frequency and spectra of spontaneous and ethyl methanesulfonate-induced mutations across the Saccharomyces cerevisiae genome.  相似文献   

8.
9.
C S Du  X Ren  L Chen  W Jiang  Y He  M Yang 《Human heredity》1999,49(3):133-138
Glucose-6-phosphate dehydrogenase (G6PD) is the most common human enzymopathy. To date more than 122 mutations in the G6PD gene have been discovered, among which 12 point mutations are found in the Chinese. The 2 most common mutations, G1388A and G1376T, account for more than 50% of mutations representing various regions and ethnic groups in China. Setting up a simple and accurate method for detecting these mutations is not only useful for studying the frequency of the G6PD genotypes, but also for finding new mutations. The purpose of this study was to find a simple, inexpensive and accurate method for detecting these common mutations. The amplification refractory mutation system (ARMS) method was used in this study. Samples from 28 G6PD-deficient males were investigated. The natural and mismatched amplification and restriction enzyme digestion method was used as a standard method to evaluate the nature of the point mutations. Sixteen cases were found carrying the G1388A mutation and 12 the G1376T mutation. Fourteen cases of G1388A and 10 cases of G1376T were confirmed by ARMS. Four cases were not in concordance with the results obtained by the mismatched amplification-restriction enzyme digestion. These 4 cases were then judged by direct PCR sequencing at exon 12. The DNA sequencing data supported the results obtained by ARMS. Thus we concluded that the ARMS is a rapid, simple, inexpensive and accurate method for detecting the most common G6PD gene mutations among the Chinese.  相似文献   

10.
Somatic mutations identified on genes related to the cancer-developing signaling pathways have drawn attention in the field of personalized medicine in recent years. Treatments developed to target a specific signaling pathway may not be effective when tumor activating mutations occur downstream of the target and bypass the targeted mechanism. For instance, mutations detected in KRAS/BRAF/NRAS genes can lead to EGFR-independent intracellular signaling pathway activation. Most patients with these mutations do not respond well to anti-EGFR treatment. In an effort to detect various mutations in FFPE tissue samples among multiple solid tumor types for patient stratification many mutation assays were evaluated. Since there were more than 30 specific mutations among three targeted RAS/RAF oncogenes that could activate MAPK pathway genes, a custom designed Single Nucleotide Primer Extension (SNPE) multiplexing mutation assay was developed and analytically validated as a clinical trial assay. Throughout the process of developing and validating the assay we overcame many technical challenges which include: the designing of PCR primers for FFPE tumor tissue samples versus normal blood samples, designing of probes for detecting consecutive nucleotide double mutations, the kinetics and thermodynamics aspects of probes competition among themselves and against target PCR templates, as well as validating an assay when positive control tumor tissue or cell lines with specific mutations are not available. We used Next Generation sequencing to resolve discordant calls between the SNPE mutation assay and Sanger sequencing. We also applied a triplicate rule to reduce potential false positives and false negatives, and proposed special considerations including pre-define a cut-off percentage for detecting very low mutant copies in the wild-type DNA background.  相似文献   

11.
PCR permits the exponential and sequence-specific amplification of DNA, even from minute starting quantities. PCR is a fundamental step in preparing DNA samples for high-throughput sequencing. However, there are errors associated with PCR-mediated amplification. Here we examine the effects of four important sources of error—bias, stochasticity, template switches and polymerase errors—on sequence representation in low-input next-generation sequencing libraries. We designed a pool of diverse PCR amplicons with a defined structure, and then used Illumina sequencing to search for signatures of each process. We further developed quantitative models for each process, and compared predictions of these models to our experimental data. We find that PCR stochasticity is the major force skewing sequence representation after amplification of a pool of unique DNA amplicons. Polymerase errors become very common in later cycles of PCR but have little impact on the overall sequence distribution as they are confined to small copy numbers. PCR template switches are rare and confined to low copy numbers. Our results provide a theoretical basis for removing distortions from high-throughput sequencing data. In addition, our findings on PCR stochasticity will have particular relevance to quantification of results from single cell sequencing, in which sequences are represented by only one or a few molecules.  相似文献   

12.
Detecting single base substitutions as heteroduplex polymorphisms.   总被引:25,自引:0,他引:25  
We have developed a sensitive technique for detecting single base substitutions in polymerase chain reaction (PCR) products from individuals heterozygous for polymorphisms or new mutations. This technique takes advantage of the formation of heteroduplexes in the PCR between different alleles from heterozygous individuals. These heteroduplexes can be detected on polyacrylamide gels because they migrate slower than their corresponding homoduplexes. Using PCR, we have generated a series of point mutations in a defined region of DNA in the equine infectious anemia virus (EIAV). Each mutation is the result of a single base substitution. By mixing the PCR products amplified from these mutations with one another, as well as with wildtype PCR products, we can generate heteroduplexes in which the identity of the mismatched bases is known. We detected eight of nine point mutations using this technique. We have also modified the electrophoretic conditions to optimize the detection of these heteroduplexes. In addition, the usefulness of this technique is demonstrated by its ability to detect a mutation in the cystic fibrosis gene that is the result of a single base substitution. This technique should prove useful for rapidly screening large numbers of individuals for new mutations or polymorphisms.  相似文献   

13.
Efforts to detect and investigate key oncogenic mutations have proven valuable to facilitate the appropriate treatment for cancer patients. The establishment of high-throughput, massively parallel "next-generation" sequencing has aided the discovery of many such mutations. To enhance the clinical and translational utility of this technology, platforms must be high-throughput, cost-effective, and compatible with formalin-fixed paraffin embedded (FFPE) tissue samples that may yield small amounts of degraded or damaged DNA. Here, we describe the preparation of barcoded and multiplexed DNA libraries followed by hybridization-based capture of targeted exons for the detection of cancer-associated mutations in fresh frozen and FFPE tumors by massively parallel sequencing. This method enables the identification of sequence mutations, copy number alterations, and select structural rearrangements involving all targeted genes. Targeted exon sequencing offers the benefits of high throughput, low cost, and deep sequence coverage, thus conferring high sensitivity for detecting low frequency mutations.  相似文献   

14.

Background

Ultra-deep pyrosequencing (UDPS) is used to identify rare sequence variants. The sequence depth is influenced by several factors including the error frequency of PCR and UDPS. This study investigated the characteristics and source of errors in raw and cleaned UDPS data.

Results

UDPS of a 167-nucleotide fragment of the HIV-1 SG3Δenv plasmid was performed on the Roche/454 platform. The plasmid was diluted to one copy, PCR amplified and subjected to bidirectional UDPS on three occasions. The dataset consisted of 47,693 UDPS reads. Raw UDPS data had an average error frequency of 0.30% per nucleotide site. Most errors were insertions and deletions in homopolymeric regions. We used a cleaning strategy that removed almost all indel errors, but had little effect on substitution errors, which reduced the error frequency to 0.056% per nucleotide. In cleaned data the error frequency was similar in homopolymeric and non-homopolymeric regions, but varied considerably across sites. These site-specific error frequencies were moderately, but still significantly, correlated between runs (r = 0.15–0.65) and between forward and reverse sequencing directions within runs (r = 0.33–0.65). Furthermore, transition errors were 48-times more common than transversion errors (0.052% vs. 0.001%; p<0.0001). Collectively the results indicate that a considerable proportion of the sequencing errors that remained after data cleaning were generated during the PCR that preceded UDPS.

Conclusions

A majority of the sequencing errors that remained after data cleaning were introduced by PCR prior to sequencing, which means that they will be independent of platform used for next-generation sequencing. The transition vs. transversion error bias in cleaned UDPS data will influence the detection limits of rare mutations and sequence variants.  相似文献   

15.
Biochemical assays for ras mutations are capable of detecting a mutant allele only if it is present in at least 5% of cells tested. Further, ras mutation assays which utilize the polymerase chain reaction (PCR) are unable to distinguish a ras mutation in a small population of cells from mutations resulting from Taq DNA polymerase base misincorporation. We used a standard restriction fragment length polymorphism assay of PCR-amplified c-Ki-ras to detect codon 12 mutations in tumor cells and found a cumulative error frequency for Taq DNA polymerase of one codon 12 mutation per 2 X 10(4) molecules of total amplification product. The Taq polymerase-induced mutations were found to be multiple base transitions and represented a constant proportion of the amplification product at each step of the PCR. The ability to detect the in vitro generated mutation was dependent on the number of thermal cycles and the sensitivity of the detection assay. With these considerations in mind, we developed a two-step RFLP assay in which the thermal cycle number was kept low and molecules containing mutations at codon 12 were selectively amplified in the second step. We were able to detect a ras mutation occurring in 1 per 1000 cells (a two log improvement over standard RFLP methods) without detecting mutations resulting from Taq DNA polymerase infidelity.  相似文献   

16.
Direct sequencing remains the most widely used method for the detection of epidermal growth factor receptor (EGFR) mutations in lung cancer; however, its relatively low sensitivity limits its clinical use. The objective of this study was to investigate the sensitivity of detecting an epidermal growth factor receptor (EGFR) mutation from peptide nucleic acid-locked nucleic acid polymerase chain reaction (PNA-LNA PCR) clamp and Ion Torrent Personal Genome Machine (PGM) techniques compared to that by direct sequencing. Furthermore, the predictive efficacy of EGFR mutations detected by PNA-LNA PCR clamp was evaluated. EGFR mutational status was assessed by direct sequencing, PNA-LNA PCR clamp, and Ion Torrent PGM in 57 patients with non-small cell lung cancer (NSCLC). We evaluated the predictive efficacy of PNA-LNA PCR clamp on the EGFR-TKI treatment in 36 patients with advanced NSCLC retrospectively. Compared to direct sequencing (16/57, 28.1%), PNA-LNA PCR clamp (27/57, 47.4%) and Ion Torrent PGM (26/57, 45.6%) detected more EGFR mutations. EGFR mutant patients had significantly longer progressive free survival (14.31 vs. 21.61 months, P = 0.003) than that of EGFR wild patients when tested with PNA-LNA PCR clamp. However, no difference in response rate to EGFR TKIs (75.0% vs. 82.4%, P = 0.195) or overall survival (34.39 vs. 44.10 months, P = 0.422) was observed between the EGFR mutations by direct sequencing or PNA-LNA PCR clamp. Our results demonstrate firstly that patients with EGFR mutations were detected more frequently by PNA-LNA PCR clamp and Ion Torrent PGM than those by direct sequencing. EGFR mutations detected by PNA-LNA PCR clamp may be as a predicative factor for EGFR TKI response in patients with NSCLC.  相似文献   

17.
Mitochondrial DNA (mtDNA) deletion mutations cause many human diseases and are linked to age-induced mitochondrial dysfunction. Mapping the mutation spectrum and quantifying mtDNA deletion mutation frequency is challenging with next-generation sequencing methods. We hypothesized that long-read sequencing of human mtDNA across the lifespan would detect a broader spectrum of mtDNA rearrangements and provide a more accurate measurement of their frequency. We employed nanopore Cas9-targeted sequencing (nCATS) to map and quantitate mtDNA deletion mutations and develop analyses that are fit-for-purpose. We analyzed total DNA from vastus lateralis muscle in 15 males ranging from 20 to 81 years of age and substantia nigra from three 20-year-old and three 79-year-old men. We found that mtDNA deletion mutations detected by nCATS increased exponentially with age and mapped to a wider region of the mitochondrial genome than previously reported. Using simulated data, we observed that large deletions are often reported as chimeric alignments. To address this, we developed two algorithms for deletion identification which yield consistent deletion mapping and identify both previously reported and novel mtDNA deletion breakpoints. The identified mtDNA deletion frequency measured by nCATS correlates strongly with chronological age and predicts the deletion frequency as measured by digital PCR approaches. In substantia nigra, we observed a similar frequency of age-related mtDNA deletions to those observed in muscle samples, but noted a distinct spectrum of deletion breakpoints. NCATS-mtDNA sequencing allows the identification of mtDNA deletions on a single-molecule level, characterizing the strong relationship between mtDNA deletion frequency and chronological aging.  相似文献   

18.
RNA viruses within infected individuals exist as a population of evolutionary-related variants. Owing to evolutionary change affecting the constitution of this population, the frequency and/or occurrence of individual viral variants can show marked or subtle fluctuations. Since the development of massively parallel sequencing platforms, such viral populations can now be investigated to unprecedented resolution. A critical problem with such analyses is the presence of sequencing-related errors that obscure the identification of true biological variants present at low frequency. Here, we report the development and assessment of the Quality Assessment of Short Read (QUASR) Pipeline (http://sourceforge.net/projects/quasr) specific for virus genome short read analysis that minimizes sequencing errors from multiple deep-sequencing platforms, and enables post-mapping analysis of the minority variants within the viral population. QUASR significantly reduces the error-related noise in deep-sequencing datasets, resulting in increased mapping accuracy and reduction of erroneous mutations. Using QUASR, we have determined influenza virus genome dynamics in sequential samples from an in vitro evolution of 2009 pandemic H1N1 (A/H1N1/09) influenza from samples sequenced on both the Roche 454 GSFLX and Illumina GAIIx platforms. Importantly, concordance between the 454 and Illumina sequencing allowed unambiguous minority-variant detection and accurate determination of virus population turnover in vitro.  相似文献   

19.
Little is known about the rate at which genetic variation is generated within intrahost populations of dengue virus (DENV) and what implications this diversity has for dengue pathogenesis, disease severity, and host immunity. Previous studies of intrahost DENV variation have used a low frequency of sampling and/or experimental methods that do not fully account for errors generated through amplification and sequencing of viral RNAs. We investigated the extent and pattern of genetic diversity in sequence data in domain III (DIII) of the envelope (E) gene in serial plasma samples (n = 49) taken from 17 patients infected with DENV type 1 (DENV-1), totaling some 8,458 clones. Statistically rigorous approaches were employed to account for artifactual variants resulting from amplification and sequencing, which we suggest have played a major role in previous studies of intrahost genetic variation. Accordingly, nucleotide sequence diversities of viral populations were very low, with conservative estimates of the average levels of genetic diversity ranging from 0 to 0.0013. Despite such sequence conservation, we observed clear evidence for mixed infection, with the presence of multiple phylogenetically distinct lineages present within the same host, while the presence of stop codon mutations in some samples suggests the action of complementation. In contrast to some previous studies we observed no relationship between the extent and pattern of DENV-1 genetic diversity and disease severity, immune status, or level of viremia.  相似文献   

20.
Recent sequencing of the Chinese hamster ovary (CHO) cell and Chinese hamster genomes has dramatically advanced our ability to understand the biology of these mammalian cell factories. In this study, we focus on the powerhouse of the CHO cell, the mitochondrion. Utilizing a high-resolution next generation sequencing approach we sequenced the Chinese hamster mitochondrial genome for the first time and surveyed the mutational landscape of CHO cell mitochondrial DNA (mtDNA). Depths of coverage ranging from ~3,319X to 8,056X enabled accurate identification of low frequency mutations (>1%), revealing that mtDNA heteroplasmy is widespread in CHO cells. A total of 197 variants at 130 individual nucleotide positions were identified across a panel of 22 cell lines with 81% of variants occurring at an allele frequency of between 1% and 99%. 89% of the heteroplasmic mutations identified were cell line specific with the majority of shared heteroplasmic SNPs and INDELs detected in clones from 2 cell line development projects originating from the same host cell line. The frequency of common predicted loss of function mutations varied significantly amongst the clones indicating that heteroplasmic mtDNA variation could lead to a continuous range of phenotypes and play a role in cell to cell, production run to production run and indeed clone to clone variation in CHO cell metabolism. Experiments that integrate mtDNA sequencing with metabolic flux analysis and metabolomics have the potential to improve cell line selection and enhance CHO cell metabolic phenotypes for biopharmaceutical manufacturing through rational mitochondrial genome engineering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号