首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 111 毫秒
1.

Background

The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.

Results

We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.

Conclusions

Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
  相似文献   

2.
二代测序技术的涌现推动了基因组学研究,特别是在疾病相关的遗传变异研究中发挥了重要作用.虽然大多数遗传变异类型都可以借助于各种二代测序分析工具进行检测,但是仍然存在局限性,比如短串联重复序列的长度变异.许多遗传疾病是由短串联重复序列的长度扩张导致的,尤其是亨廷顿病等多种神经系统疾病.然而,现在几乎没有工具能够利用二代测序检测长度大于测序读长的短串联重复序列变异.为了突破这一限制,我们开发了一个全新的方法,该方法基于双末端二代测序辨识短串联重复序列长度变异,并可估计其扩张长度,将其应用于一项基于全外显子组测序的运动神经元疾病临床研究中,成功地鉴定出致病的短串联重复序列长度扩张.该方法首次原创性地利用测序读长覆盖深度特征来解决短串联重复序列变异检测问题,在人类遗传疾病研究中具有广泛的应用价值,并且对于其他二代测序分析方法的开发具有启发性意义.  相似文献   

3.
The importance of next generation sequencing (NGS) rises in cancer research as accessing this key technology becomes easier for researchers. The sequence data created by NGS technologies must be processed by various bioinformatics algorithms within a pipeline in order to convert raw data to meaningful information. Mapping and variant calling are the two main steps of these analysis pipelines, and many algorithms are available for these steps. Therefore, detailed benchmarking of these algorithms in different scenarios is crucial for the efficient utilization of sequencing technologies. In this study, we compared the performance of twelve pipelines (three mapping and four variant discovery algorithms) with recommended settings to capture single nucleotide variants. We observed significant discrepancy in variant calls among tested pipelines for different heterogeneity levels in real and simulated samples with overall high specificity and low sensitivity. Additional to the individual evaluation of pipelines, we also constructed and tested the performance of pipeline combinations. In these analyses, we observed that certain pipelines complement each other much better than others and display superior performance than individual pipelines. This suggests that adhering to a single pipeline is not optimal for cancer sequencing analysis and sample heterogeneity should be considered in algorithm optimization.  相似文献   

4.
In disease studies, family-based designs have become an attractive approach to analyzing next-generation sequencing (NGS) data for the identification of rare mutations enriched in families. Substantial research effort has been devoted to developing pipelines for automating sequence alignment, variant calling, and annotation. However, fewer pipelines have been designed specifically for disease studies. Most of the current analysis pipelines for family-based disease studies using NGS data focus on a specific function, such as identifying variants with Mendelian inheritance or identifying shared chromosomal regions among affected family members. Consequently, some other useful family-based analysis tools, such as imputation, linkage, and association tools, have yet to be integrated and automated. We developed FamPipe, a comprehensive analysis pipeline, which includes several family-specific analysis modules, including the identification of shared chromosomal regions among affected family members, prioritizing variants assuming a disease model, imputation of untyped variants, and linkage and association tests. We used simulation studies to compare properties of some modules implemented in FamPipe, and based on the results, we provided suggestions for the selection of modules to achieve an optimal analysis strategy. The pipeline is under the GNU GPL License and can be downloaded for free at http://fampipe.sourceforge.net.
This is a PLOS Computational Biology Software article.
  相似文献   

5.
6.
Next generation sequencing (NGS) has traditionally been performed in various fields including agricultural to clinical and there are so many sequencing platforms available in order to obtain accurate and consistent results. However, these platforms showed amplification bias when facilitating variant calls in personal genomes. Here, we sequenced whole genomes and whole exomes from ten Korean individuals using Illumina and Ion Proton, respectively to find the vulnerability and accuracy of NGS platform in the GC rich/poor area. Overall, a total of 1013 Gb reads from Illumina and ~39.1 Gb reads from Ion Proton were analyzed using BWA-GATK variant calling pipeline. Furthermore, conjunction with the VQSR tool and detailed filtering strategies, we achieved high-quality variants. Finally, each of the ten variants from Illumina only, Ion Proton only, and intersection was selected for Sanger validation. The validation results revealed that Illumina platform showed higher accuracy than Ion Proton. The described filtering methods are advantageous for large population-based whole genome studies designed to identify common and rare variations associated with complex diseases.  相似文献   

7.
All next-generation sequencing (NGS) procedures include assays performed at the laboratory bench ("wet bench") and data analyses conducted using bioinformatics pipelines ("dry bench"). Both elements are essential to produce accurate and reliable results, which are particularly critical for clinical laboratories. Targeted NGS technologies have increasingly found favor in oncology applications to help advance precision medicine objectives, yet the methods often involve disconnected and variable wet and dry bench workflows and uncoordinated reagent sets. In this report, we describe a method for sequencing challenging cancer specimens with a 21-gene panel as an example of a comprehensive targeted NGS system. The system integrates functional DNA quantification and qualification, single-tube multiplexed PCR enrichment, and library purification and normalization using analytically-verified, single-source reagents with a standalone bioinformatics suite. As a result, accurate variant calls from low-quality and low-quantity formalin-fixed, paraffin-embedded (FFPE) and fine-needle aspiration (FNA) tumor biopsies can be achieved. The method can routinely assess cancer-associated variants from an input of 400 amplifiable DNA copies, and is modular in design to accommodate new gene content. Two different types of analytically-defined controls provide quality assurance and help safeguard call accuracy with clinically-relevant samples. A flexible "tag" PCR step embeds platform-specific adaptors and index codes to allow sample barcoding and compatibility with common benchtop NGS instruments. Importantly, the protocol is streamlined and can produce 24 sequence-ready libraries in a single day. Finally, the approach links wet and dry bench processes by incorporating pre-analytical sample quality control results directly into the variant calling algorithms to improve mutation detection accuracy and differentiate false-negative and indeterminate calls. This targeted NGS method uses advances in both wetware and software to achieve high-depth, multiplexed sequencing and sensitive analysis of heterogeneous cancer samples for diagnostic applications.  相似文献   

8.
The advent of next generation sequencing (NGS) technologies have revolutionised the way biologists produce, analyse and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries. Recently, whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customise the NGS data analysis pipeline to preferentially retain true variants by minimising the incidence of false positives and make the choice of right analytical tools easier. To this end, we have sampled different freely available tools used at the alignment and post alignment stage suggesting the use of the most suitable combination determined by a simple framework of pre-existing metrics to create significant datasets.  相似文献   

9.
As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits,a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data,sample quality issues,such as gender mismatch,abnormal inbreeding coefficient,cryptic relatedness,and population outliers,can also have fundamental impact on downstream analysis. However,there is a lack of tools specialized in identifying problematic samples from NGS data,often due to the limitation of sample size and variant counts. We developed SeqSQC,a Bioconductor package,to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access,and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor. org/packages/SeqSQC.  相似文献   

10.
Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid.  相似文献   

11.
With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite for sensitive variant detection. We model the crucial steps in an NGS protocol as a stochastic branching process and derive a mathematical framework for the expected distribution of alleles at heterozygous loci before measurement that is sequencing. We confirm our theoretical results by analyzing technical replicates of human exome data and demonstrate that the variance of allele frequencies at heterozygous loci is higher than expected by a simple binomial distribution. Due to this high variance, mutation callers relying on binomial distributed priors are less sensitive for heterozygous variants that deviate strongly from the expected mean frequency. Our results also indicate that error rates can be reduced to a greater degree by technical replicates than by increasing sequencing depth.  相似文献   

12.
The treatment paradigm of non-small cell lung cancer (NSCLC) has evolved into oncogene-directed precision medicine. Identifying actionable genomic alterations is the initial step towards precision medicine. An important scientific progress in molecular profiling of NSCLC over the past decade is the shift from the traditional piecemeal fashion to massively parallel sequencing with the use of next-generation sequencing (NGS). Another technical advance is the development of liquid biopsy with great potential in providing a dynamic and comprehensive genomic profiling of NSCLC in a minimally invasive manner. The integration of NGS with liquid biopsy has been demonstrated to play emerging roles in genomic profiling of NSCLC by increasing evidences. This review summarized the potential applications of NGS-based liquid biopsy in the diagnosis and treatment of NSCLC including identifying actionable genomic alterations, tracking spatiotemporal tumor evolution, dynamically monitoring response and resistance to targeted therapies, and diagnostic value in early-stage NSCLC, and discussed emerging challenges to overcome in order to facilitate clinical translation in future.  相似文献   

13.
The objective of this study was to design and validate a next-generation sequencing assay (NGS) to detect BRCA1 and BRCA2 mutations. We developed an assay using random shearing of genomic DNA followed by RNA bait tile hybridization and NGS sequencing on both the Illumina MiSeq and Ion Personal Gene Machine (PGM). We determined that the MiSeq Reporter software supplied with the instrument could not detect deletions greater than 9 base pairs. Therefore, we developed an alternative alignment and variant calling software, Quest Sequencing Analysis Pipeline (QSAP), that was capable of detecting large deletions and insertions. In validation studies, we used DNA from 27 stem cell lines, all with known deleterious BRCA1 or BRCA2 mutations, and DNA from 67 consented control individuals who had a total of 352 benign variants. Both the MiSeq/QSAP combination and PGM/Torrent Suite combination had 100% sensitivity for the 379 known variants in the validation series. However, the PGM/Torrent Suite combination had a lower intra- and inter-assay precision of 96.2% and 96.7%, respectively when compared to the MiSeq/QSAP combination of 100% and 99.4%, respectively. All PGM/Torrent Suite inconsistencies were false-positive variant assignments. We began commercial testing using both platforms and in the first 521 clinical samples MiSeq/QSAP had 100% sensitivity for BRCA1/2 variants, including a 64-bp deletion and a 10-bp insertion not identified by PGM/Torrent Suite, which also suffered from a high false-positive rate. Neither the MiSeq nor PGM platform with their supplied alignment and variant calling software are appropriate for a clinical laboratory BRCA sequencing test. We have developed an NGS BRCA1/2 sequencing assay, MiSeq/QSAP, with 100% analytic sensitivity and specificity in the validation set consisting of 379 variants. The MiSeq/QSAP combination has sufficient performance for use in a clinical laboratory.  相似文献   

14.
15.
This article reviews basic concepts,general applications,and the potential impact of next-generation sequencing(NGS)technologies on genomics,with particular reference to currently available and possible future platforms and bioinformatics.NGS technologies have demonstrated the capacity to sequence DNA at unprecedented speed,thereby enabling previously unimaginable scientific achievements and novel biological applications.But,the massive data produced by NGS also presents a significant challenge for data storage,analyses,and management solutions.Advanced bioinformatic tools are essential for the successful application of NGS technology.As evidenced throughout this review,NGS technologies will have a striking impact on genomic research and the entire biological field.With its ability to tackle the unsolved challenges unconquered by previous genomic technologies,NGS is likely to unravel the complexity of the human genome in terms of genetic variations,some of which may be confined to susceptible loci for some common human conditions.The impact of NGS technologies on genomics will be far reaching and likely change the field for years to come.  相似文献   

16.
Simple sequence repeats (SSRs) are widely used genetic markers in ecology, evolution, and conservation even in the genomics era, while a general limitation to their application is the difficulty of developing polymorphic SSR markers. Next‐generation sequencing (NGS) offers the opportunity for the rapid development of SSRs; however, previous studies developing SSRs using genomic data from only one individual need redundant experiments to test the polymorphisms of SSRs. In this study, we designed a pipeline for the rapid development of polymorphic SSR markers from multi‐sample genomic data. We used bioinformatic software to genotype multiple individuals using resequencing data, detected highly polymorphic SSRs prior to experimental validation, significantly improved the efficiency and reduced the experimental effort. The pipeline was successfully applied to a globally threatened species, the brown eared‐pheasant (Crossoptilon mantchuricum), which showed very low genomic diversity. The 20 newly developed SSR markers were highly polymorphic, the average number of alleles was much higher than the genomic average. We also evaluated the effect of the number of individuals and sequencing depth on the SSR mining results, and we found that 10 individuals and ~10X sequencing data were enough to obtain a sufficient number of polymorphic SSRs, even for species with low genetic diversity. Furthermore, the genome assembly of NGS data from the optimal number of individuals and sequencing depth can be used as an alternative reference genome if a high‐quality genome is not available. Our pipeline provided a paradigm for the application of NGS technology to mining and developing molecular markers for ecological and evolutionary studies.  相似文献   

17.
The recent FDA approval of the MiSeqDx platform provides a unique opportunity to develop targeted next generation sequencing (NGS) panels for human disease, including cancer. We have developed a scalable, targeted panel-based assay termed UNCseq, which involves a NGS panel of over 200 cancer-associated genes and a standardized downstream bioinformatics pipeline for detection of single nucleotide variations (SNV) as well as small insertions and deletions (indel). In addition, we developed a novel algorithm, NGScopy, designed for samples with sparse sequencing coverage to detect large-scale copy number variations (CNV), similar to human SNP Array 6.0 as well as small-scale intragenic CNV. Overall, we applied this assay to 100 snap-frozen lung cancer specimens lacking same-patient germline DNA (07–0120 tissue cohort) and validated our results against Sanger sequencing, SNP Array, and our recently published integrated DNA-seq/RNA-seq assay, UNCqeR, where RNA-seq of same-patient tumor specimens confirmed SNV detected by DNA-seq, if RNA-seq coverage depth was adequate. In addition, we applied the UNCseq assay on an independent lung cancer tumor tissue collection with available same-patient germline DNA (11–1115 tissue cohort) and confirmed mutations using assays performed in a CLIA-certified laboratory. We conclude that UNCseq can identify SNV, indel, and CNV in tumor specimens lacking germline DNA in a cost-efficient fashion.  相似文献   

18.

Background

Next-Generation Sequencing (NGS) technologies have rapidly advanced our understanding of human variation in cancer. To accurately translate the raw sequencing data into practical knowledge, annotation tools, algorithms and pipelines must be developed that keep pace with the rapidly evolving technology. Currently, a challenge exists in accurately annotating multi-nucleotide variants (MNVs). These tandem substitutions, when affecting multiple nucleotides within a single protein codon of a gene, result in a translated amino acid involving all nucleotides in that codon. Most existing variant callers report a MNV as individual single-nucleotide variants (SNVs), often resulting in multiple triplet codon sequences and incorrect amino acid predictions. To correct potentially misannotated MNVs among reported SNVs, a primary challenge resides in haplotype phasing which is to determine whether the neighboring SNVs are co-located on the same chromosome.

Results

Here we describe MAC (Multi-Nucleotide Variant Annotation Corrector), an integrative pipeline developed to correct potentially mis-annotated MNVs. MAC was designed as an application that only requires a SNV file and the matching BAM file as data inputs. Using an example data set containing 3024 SNVs and the corresponding whole-genome sequencing BAM files, we show that MAC identified eight potentially mis-annotated SNVs, and accurately updated the amino acid predictions for seven of the variant calls.

Conclusions

MAC can identify and correct amino acid predictions that result from MNVs affecting multiple nucleotides within a single protein codon, which cannot be handled by most existing SNV-based variant pipelines. The MAC software is freely available and represents a useful tool for the accurate translation of genomic sequence to protein function.  相似文献   

19.
20.

Background

Next generation sequencing (NGS) methods have significantly contributed to a paradigm shift in genomic research for nearly a decade now. These methods have been useful in studying the dynamic interactions between RNA viruses and human hosts.

Scope of the review

In this review, we summarise and discuss key applications of NGS in studying the host – pathogen interactions in RNA viral infections of humans with examples.

Major conclusions

Use of NGS to study globally relevant RNA viral infections have revolutionized our understanding of the within host and between host evolution of these viruses. These methods have also been useful in clinical decision-making and in guiding biomedical research on vaccine design.

General significance

NGS has been instrumental in viral genomic studies in resolving within-host viral genomic variants and the distribution of nucleotide polymorphisms along the full-length of viral genomes in a high throughput, cost effective manner. In the future, novel advances such as long read, single molecule sequencing of viral genomes and simultaneous sequencing of host and pathogens may become the standard of practice in research and clinical settings. This will also bring on new challenges in big data analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号