首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
When analyzing sequencing reads, it is important to distinguish between putative correct and wrong bases. An open question is how a PHRED quality value is capable of identifying the miscalled bases and if there is a quality cutoff that allows mapping of most errors. Considering the fact that a low quality value does not necessarily indicate a miscalled position, we decided to investigate if window-based analyses of quality values might better predict errors. There are many reasons to look for a perfect window in DNA sequences, such as when using SAGE technique, looking for BLAST seeding and clustering sequences. Thus, we set out to find a quality cutoff value that would distinguish non-perfect windows from perfect ones. We produced and compared 846 reads of pUC18 with the published pUC consensus, by local alignment. We then generated a database containing all mismatches, insertions and gaps in order to map real perfect windows. An investigation was made to find the potential to predict perfect windows when all bases in the window show quality values over a given cutoff. We conclude that, in window-based applications, a PHRED quality value cutoff of 7 masks most of the errors without masking real correct windows. We suggest that the putative wrong bases be indicated in lower case, increasing the information on the sequence databases without increasing the size the files.  相似文献   

2.
The use of post-alignment procedures has been suggested to prevent the identification of false-positives in massive DNA sequencing data. Insertions and deletions are most likely to be misinterpreted by variant calling algorithms. Using known genetic variants as references for post-processing pipelines can minimize mismatches. They allow reads to be correctly realigned and recalibrated, resulting in more parsimonious variant calling. In this work, we aim to investigate the impact of using different sets of common variants as references to facilitate variant calling from whole-exome sequencing data. We selected reference variants from common insertions and deletions available within the 1K Genomes project data and from databases from the Latin American Database of Genetic Variation (LatinGen). We used the Genome Analysis Toolkit to perform post-processing procedures like local realignment, quality recalibration procedures, and variant calling in whole exome samples. We identified an increased number of variants from the call set for all groups when no post-processing procedure was performed. We found that there was a higher concordance rate between variants called using 1K Genomes and LatinGen. Therefore, we believe that the increased number of rare variants identified in the analysis without realignment or quality recalibration indicated that they were likely false-positives.  相似文献   

3.
The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.  相似文献   

4.
《Genomics》2022,114(3):110372
Modifications in RNA can influence their structure, function, and stability and play essential roles in gene expression and regulation. Methods to detect RNA modifications rely on biophysical techniques such as chromatography or mass spectrometry, which are low throughput, or on high throughput short-read sequencing techniques based on selectively reactive chemical probes. Recent studies have utilized nanopore-based fourth-generation sequencing methods to detect modifications by directly sequencing RNA in its native state. However, these approaches are based on modification-associated mismatch errors that are liable to be confounded by SNPs. Also, there is a need to generate matched knockout controls for reference, which is laborious. In this work, we introduce an internal comparison strategy termed “IndoC,” where features such as ‘trace’ and ‘current signal intensity’ of potentially modified sites are compared to similar sequence contexts on the same RNA molecule within the sample, alleviating the need for matched knockout controls. We first show that in an IVT model, ‘trace’ is able to distinguish between artificially generated SNPs and true pseudouridine (Ψ) modifications, both of which display highly similar mismatch profiles. We then apply IndoC on yeast and human ribosomal RNA to demonstrate that previously reported Ψ sites show marked changes in their trace and signal intensity profiles compared with their unmodified counterparts in the same dataset. Finally, we perform direct RNA sequencing of RNA containing Ψ intact with a chemical probe adduct (N-cyclohexyl-N′-β-(4-methylmorpholinium) ethylcarbodiimide [CMC]) and show that CMC reactivity also induces changes in trace and signal intensity distributions in a Ψ specific manner, allowing their separation from high mismatch sites that display SNP-like behavior.  相似文献   

5.

Background

High-throughput custom designed genotyping arrays are a valuable resource for biologically focused research studies and increasingly for validation of variation predicted by next-generation sequencing (NGS) technologies. We investigate the Illumina GoldenGate chemistry using custom designed VeraCode and sentrix array matrix (SAM) assays for each of these applications, respectively. We highlight applications for interpretation of Illumina generated genotype cluster plots to maximise data inclusion and reduce genotyping errors.

Findings

We illustrate the dramatic effect of outliers in genotype calling and data interpretation, as well as suggest simple means to avoid genotyping errors. Furthermore we present this platform as a successful method for two-cluster rare or non-autosomal variant calling. The success of high-throughput technologies to accurately call rare variants will become an essential feature for future association studies. Finally, we highlight additional advantages of the Illumina GoldenGate chemistry in generating unusually segregated cluster plots that identify potential NGS generated sequencing error resulting from minimal coverage.

Conclusions

We demonstrate the importance of visually inspecting genotype cluster plots generated by the Illumina software and issue warnings regarding commonly accepted quality control parameters. In addition to suggesting applications to minimise data exclusion, we propose that the Illumina cluster plots may be helpful in identifying potential in-put sequence errors, particularly important for studies to validate NGS generated variation.
  相似文献   

6.
There are little independent data available about how well single nucleotide polymorphism (SNP) genotyping technologies perform in the typical molecular genetics laboratory. We evaluated the utility and accuracy of a widely used technology, template-directed dye-terminator incorporation with fluorescence-polarization detection (FP-TDI), in a sample of 177 SNPs selected solely on the basis of map location. Genotypes were generated without optimization using standard protocols. Overall, 81% of the SNPs we studied generated readable genotypes by FP-TDI. Thirty-two SNPs were genotyped in duplicate by PCR-RFLP orfluorescent dye-terminator sequencing. Out of a total of 631 duplicate genotypes, no true discrepancies were detected. The true error rate has a 95% chance of lying between 0 and 6 out of 1000 genotypes. We also tested for deviations from Hardy-Weinberg Equilibrium in 33 SNPs genotyped in 50 unrelated individuals, and no significant deviations were detected. Our FP-TDI data were readily adaptable to automated genotype calling using our own method of cluster analysis, which assigns a probability score to each genotype call. We conclude that FP-TDI is both efficient and accurate. The method can easily fill the needs of SNP genotyping projects at the scale typically used for regional or candidate-gene association studies.  相似文献   

7.
Objective: Our goal was to evaluate the influence of quality control (QC) decisions using two genotype calling algorithms, CRLMM and Birdseed, designed for the Affymetrix SNP Array 6.0. Methods: Various QC options were tried using the two algorithms and comparisons were made on subject and call rate and on association results using two data sets. Results: For Birdseed, we recommend using the contrast QC instead of QC call rate for sample QC. For CRLMM, we recommend using the signal-to-noise rate ≥4 for sample QC and a posterior probability of 90% for genotype accuracy. For both algorithms, we recommend calling the genotype separately for each plate, and dropping SNPs with a lower call rate (<95%) before evaluating samples with lower call rates. To investigate whether the genotype calls from the two algorithms impacted the genome-wide association results, we performed association analysis using data from the GENOA cohort; we observed that the number of significant SNPs were similar using either CRLMM or Birdseed. Conclusions: Using our suggested workflow both algorithms performed similarly; however, fewer samples were removed and CRLMM took half the time to run our 854 study samples (4.2 h) compared to Birdseed (8.4 h).  相似文献   

8.

Background

The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.

Results

We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.

Conclusions

Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
  相似文献   

9.
High-throughput DNA sequencing (HTS) is of increasing importance in the life sciences. One of its most prominent applications is the sequencing of whole genomes or targeted regions of the genome such as all exonic regions (i.e., the exome). Here, the objective is the identification of genetic variants such as single nucleotide polymorphisms (SNPs). The extraction of SNPs from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. We review the essential building blocks for a pipeline that calls SNPs from raw HTS data. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. The final steps of the pipeline include the SNP calling procedure along with filtering of SNP candidates. The steps of this pipeline are accompanied by an analysis of a publicly available whole-exome sequencing dataset. To this end, we employ several alignment programs and SNP calling routines for highlighting the fact that the choice of the tools significantly affects the final results.  相似文献   

10.
Discrimination of base mismatches from normal Watson-Crick base pairs in duplex DNA constitutes a key approach to the detection of single nucleotide polymorphisms (SNPs). We have developed a sensor for a surface plasmon resonance (SPR) assay system to detect G-G, A-A, and C-C mismatch duplexes by employing a surface upon which mismatch-binding ligands (MBLs) are immobilized. We synthesized a new MBL consisting of 2,7-diamino-1,8-naphthyridine (damND) and immobilized it onto a CM5 sensor chip to carry out the SPR assay of DNA duplexes containing a single-base mismatch. The SPR sensor with damND revealed strong responses to all C-C mismatches, and sequence-dependent C-T and T-T mismatches. Compared to ND- and naphthyridine-azaquinolone hybrid (NA)-immobilized sensor surfaces, with affinity to mismatches composed of purine nucleotide bases, the damND-immobilized surface was useful for the detection of the mismatches composed of pyrimidine nucleotide bases.  相似文献   

11.
Single- and multi-base (loop) mismatches can arise in DNA by replication errors, during recombination, and by chemical modification of DNA. Single-base and loop mismatches of several nucleotides are efficiently repaired in mammalian cells by a nick-directed, MSH2-dependent mechanism. Larger loop mismatches (> or =12 bases) are repaired by an MSH2-independent mechanism. Prior studies have shown that 12- and 14-base palindromic loops are repaired with bias toward loop retention, and that repair bias is eliminated when five single-base mismatches flank the loop mismatch. Here we show that one single-base mismatch near a 12-base palindromic loop is sufficient to eliminate loop repair bias in wild-type, but not MSH2-defective mammalian cells. We also show that palindromic loop and single-base mismatches separated by 12 bases are repaired independently at least 10% of the time in wild-type cells, and at least 30% of the time in MSH2-defective cells. Palindromic loop and single-base mismatches separated by two bases were never repaired independently. These and other data indicate that loop repair tracts are variable in length. All tracts extend at least 2 bases, some extend <12 bases, and others >12 bases, on one side of the loop. These properties distinguish palindromic loop mismatch repair from the three known excision repair pathways: base excision repair which has one to six base tracts, nucleotide excision repair which has approximately 30 base tracts, and MSH2-dependent mismatch repair, which has tracts that extend for several hundred bases.  相似文献   

12.
Genetic quality and energy metabolism are expected to have an effect on the level of energetically costly sexual signaling. To explore this we manipulated genetic quality of male decorated crickets (Gryllodes sigillatus) by inbreeding and measured the resting metabolic rate and total energy budget of males. We also measured several aspects of the sexual signaling of males: probability to initiate calling, latency, amount of call bouts, first call bout duration, mean call bout duration and total time spent calling. Inbreeding increased the latency and lowered the first and mean call bout duration. Moreover, the resting metabolic rate had a positive effect, and body mass a negative effect on first call bout duration and mean call bout duration. Our results, suggest that sexual signals are indicative of genetic quality but are also dependent on the physical properties of individuals.  相似文献   

13.

Background

As availability of primary cells can be limited for genetic studies of human disease, lymphoblastoid cell lines (LCL) are common sources of genomic DNA. LCL are created in a transformation process that entails in vitro infection of human B-lymphocytes with the Epstein-Barr Virus (EBV).

Methodology/Principal Findings

To test for genotypic errors potentially induced by the Epstein-Barr Virus transformation process, we compared single nucleotide polymorphism (SNP) genotype calls in peripheral blood mononuclear cells (PBMC) and LCL from the same individuals. The average mismatch rate across 19 comparisons was 0.12% for SNPs with a population call rate of at least 95%, and 0.03% at SNPs with a call rate of at least 99%. Mismatch rates were not correlated across genotype subarrays run on all sample pairs.

Conclusions/Significance

Genotypic discrepancies found in PBMC and LCL pairs were not significantly different than control pairs, and were not correlated across subarrays. These results suggest that mismatch rates are minimal with stringent quality control, and that most genotypic discrepancies are due to technical artifacts rather than the EBV transformation process. Thus, LCL likely constitute a reliable DNA source for host genotype analysis.  相似文献   

14.
Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).  相似文献   

15.
Identifying copy number variants (CNVs) can provide diagnoses to patients and provide important biological insights into human health and disease. Current exome and targeted sequencing approaches cannot detect clinically and biologically-relevant CNVs outside their target area. We present SavvyCNV, a tool which uses off-target read data from exome and targeted sequencing data to call germline CNVs genome-wide. Up to 70% of sequencing reads from exome and targeted sequencing fall outside the targeted regions. We have developed a new tool, SavvyCNV, to exploit this ‘free data’ to call CNVs across the genome. We benchmarked SavvyCNV against five state-of-the-art CNV callers using truth sets generated from genome sequencing data and Multiplex Ligation-dependent Probe Amplification assays. SavvyCNV called CNVs with high precision and recall, outperforming the five other tools at calling CNVs genome-wide, using off-target or on-target reads from targeted panel and exome sequencing. We then applied SavvyCNV to clinical samples sequenced using a targeted panel and were able to call previously undetected clinically-relevant CNVs, highlighting the utility of this tool within the diagnostic setting. SavvyCNV outperforms existing tools for calling CNVs from off-target reads. It can call CNVs genome-wide from targeted panel and exome data, increasing the utility and diagnostic yield of these tests. SavvyCNV is freely available at https://github.com/rdemolgen/SavvySuite.  相似文献   

16.
Mispair specificity of methyl-directed DNA mismatch correction in vitro   总被引:52,自引:0,他引:52  
To evaluate the substrate specificity of methyl-directed mismatch repair in Escherichia coli extracts, we have constructed a set of DNA heteroduplexes, each of which contains one of the eight possible single base pair mismatches and a single hemimethylated d(GATC) site. Although all eight mismatches were located at the same position within heteroduplex molecules and were embedded within the same sequence environment, they were not corrected with equal efficiencies in vitro. G-T was corrected most efficiently, with A-C, C-T, A-A, T-T, and G-G being repaired at rates 40-80% of that of the G-T mispair. Correction of each of these six mispairs occurred in a methyl-directed manner in a reaction requiring mutH, mutL, and mutS gene products. C-C and A-G mismatches showed different behavior. C-C was an extremely poor substrate for correction while repair of A-G was anomalous. Although A-G was corrected to A-T by the mutHLS-dependent, methyl-directed pathway, repair of A-G to C-G occurred largely by a pathway that is independent of the methylation state of the heteroduplex and which does not require mutH, mutL, or mutS gene products. Similar results were obtained with a second A-G mismatch in a different sequence environment suggesting that a novel pathway may exist for processing A-G mispairs to C-G base pairs. As judged by DNase I footprint analysis, MutS protein is capable of recognizing each of the eight possible base-base mismatches. Use of this method to estimate the apparent affinity of MutS protein for each of the mispairs revealed a rough correlation between MutS affinity and efficiency of correction by the methyl-directed pathway. However, the A-C mismatch was an exception in this respect indicating that interactions other than mismatch recognition may contribute to the efficiency of repair.  相似文献   

17.
Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.  相似文献   

18.
Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package “MAFsnp” implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.  相似文献   

19.
Sokolsky T  Alani E 《Genetics》2000,155(2):589-599
In Saccharomyces cerevisiae, Msh2p, a central component in mismatch repair, forms a heterodimer with Msh3p to repair small insertion/deletion mismatches and with Msh6p to repair base pair mismatches and single-nucleotide insertion/deletion mismatches. In haploids, a msh2Delta mutation is synthetically lethal with pol3-01, a mutation in the Poldelta proofreading exonuclease. Six conditional alleles of msh2 were identified as those that conferred viability in pol3-01 strains at 26 degrees but not at 35 degrees. DNA sequencing revealed that mutations in several of the msh2(ts) alleles are located in regions with previously unidentified functions. The conditional inviability of two mutants, msh2-L560S pol3-01 and msh2-L910P pol3-01, was suppressed by overexpression of EXO1 and MSH6, respectively. Partial suppression was also observed for the temperature-sensitive mutator phenotype exhibited by msh2-L560S and msh2-L910P strains in the lys2-Bgl reversion assay. High-copy plasmids bearing mutations in the conserved EXO1 nuclease domain were unable to suppress msh2-L560S pol3-01 conditional lethality. These results, in combination with a genetic analysis of msh6Delta pol3-01 and msh3Delta pol3-01 strains, suggest that the activity of the Msh2p-Msh6p heterodimer is important for viability in the presence of the pol3-01 mutation and that Exo1p plays a catalytic role in Msh2p-mediated mismatch repair.  相似文献   

20.
Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号