共查询到20条相似文献,搜索用时 15 毫秒
1.
A critical step in detecting variants from next-generation sequencing data is post hoc filtering of putative variants called or predicted by computational tools. Here, we highlight four critical parameters that could enhance the accuracy of called single nucleotide variants and insertions/deletions: quality and deepness, refinement and improvement of initial mapping, allele/strand balance, and examination of spurious genes. Use of these sequence features appropriately in variant filtering could greatly improve validation rates, thereby saving time and costs in next-generation sequencing projects. 相似文献
2.
3.
4.
Gigabase-scale genome assemblies are now feasible using short-read sequencing technology, bringing the cost of such projects below the million-dollar mark. 相似文献
5.
Danny Challis Jin Yu Uday S Evani Andrew R Jackson Sameer Paithankar Cristian Coarfa Aleksandar Milosavljevic Richard A Gibbs Fuli Yu 《BMC bioinformatics》2012,13(1):8
Background
Whole exome capture sequencing allows researchers to cost-effectively sequence the coding regions of the genome. Although the exome capture sequencing methods have become routine and well established, there is currently a lack of tools specialized for variant calling in this type of data. 相似文献6.
7.
8.
9.
Muralidharan O Natsoulis G Bell J Newburger D Xu H Kela I Ji H Zhang N 《Nucleic acids research》2012,40(1):e5
Highly multiplex DNA sequencers have greatly expanded our ability to survey human genomes for previously unknown single nucleotide polymorphisms (SNPs). However, sequencing and mapping errors, though rare, contribute substantially to the number of false discoveries in current SNP callers. We demonstrate that we can significantly reduce the number of false positive SNP calls by pooling information across samples. Although many studies prepare and sequence multiple samples with the same protocol, most existing SNP callers ignore cross-sample information. In contrast, we propose an empirical Bayes method that uses cross-sample information to learn the error properties of the data. This error information lets us call SNPs with a lower false discovery rate than existing methods. 相似文献
10.
二代测序技术的发展对测序数据的处理分析提出了很高的要求。目前二代测序数据分析软件很多, 但是绝大多数软件仅能完成单一的分析功能(例如:仅进行序列比对或变异读取或功能注释等), 如何能正确高效地选择整合这些软件已成为迫切需求。文章设计了一套基于perl语言和SGE资源管理的自动化处理流程来分析Illumina平台基因组测序数据。该流程以测序原始序列数据作为输入, 调用业界标准的数据处理软件(如:BWA, Samtools, GATK, ANNOVAR等), 最终生成带有相应功能注释、便于研究者进一步分析的变异位点列表。该流程通过自动化并行脚本控制流程的高效运行, 一站式输出分析结果和报告, 简化了数据分析过程中的人工操作, 大大提高了运行效率。用户只需填写配置文件或使用图形界面输入即可完成全部操作。该工作为广大研究者分析二代测序数据提供了便利的途径。 相似文献
11.
A report on the 23rd annual meeting on 'The Biology of Genomes', 11-15 May 2010, Cold Spring Harbor, USA. 相似文献
12.
Background
The discovery and mapping of genomic variants is an essential step in most analysis done using sequencing reads. There are a number of mature software packages and associated pipelines that can identify single nucleotide polymorphisms (SNPs) with a high degree of concordance. However, the same cannot be said for tools that are used to identify the other types of variants. Indels represent the second most frequent class of variants in the human genome, after single nucleotide polymorphisms. The reliable detection of indels is still a challenging problem, especially for variants that are longer than a few bases.Results
We have developed a set of algorithms and heuristics collectively called indelMINER to identify indels from whole genome resequencing datasets using paired-end reads. indelMINER uses a split-read approach to identify the precise breakpoints for indels of size less than a user specified threshold, and supplements that with a paired-end approach to identify larger variants that are frequently missed with the split-read approach. We use simulated and real datasets to show that an implementation of the algorithm performs favorably when compared to several existing tools.Conclusions
indelMINER can be used effectively to identify indels in whole-genome resequencing projects. The output is provided in the VCF format along with additional information about the variant, including information about its presence or absence in another sample. The source code and documentation for indelMINER can be freely downloaded from www.bx.psu.edu/miller_lab/indelMINER.tar.gz.Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0483-6) contains supplementary material, which is available to authorized users. 相似文献13.
Mehdi?Pirooznia Melissa?Kramer Jennifer?Parla Fernando?S?Goes James?B?Potash W?Richard?McCombie Peter?P?Zandi
Background
The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.Results
We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.Conclusions
Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.14.
Michael Imelfort Chris Duran Jacqueline Batley David Edwards 《Plant biotechnology journal》2009,7(4):312-317
The ongoing revolution in DNA sequencing technology now enables the reading of thousands of millions of nucleotide bases in a single instrument run. However, this data quantity is often compromised by poor confidence in the read quality. The identification of genetic polymorphisms from this data is therefore problematic and, combined with the vast quantity of data, poses a major bioinformatics challenge. However, once these difficulties have been addressed, next-generation sequencing will offer a means to identify and characterize the wealth of genetic polymorphisms underlying the vast phenotypic variation in biological systems. We describe the recent advances in next-generation sequencing technology, together with preliminary approaches that can be applied for single nucleotide polymorphism discovery in plant species. 相似文献
15.
16.
Kyle Bittinger Emily S Charlson Elizabeth Loy David J Shirley Andrew R Haas Alice Laughlin Yanjie Yi Gary D Wu James D Lewis Ian Frank Edward Cantu Joshua M Diamond Jason D Christie Ronald G Collman Frederic D Bushman 《Genome biology》2014,15(10)
Background
Fungi are important pathogens but challenging to enumerate using next-generation sequencing because of low absolute abundance in many samples and high levels of fungal DNA from contaminating sources.Results
Here, we analyze fungal lineages present in the human airway using an improved method for contamination filtering. We use DNA quantification data, which are routinely acquired during DNA library preparation, to annotate output sequence data, and improve the identification and filtering of contaminants. We compare fungal communities and bacterial communities from healthy subjects, HIV+ subjects, and lung transplant recipients, providing a gradient of increasing lung impairment for comparison. We use deep sequencing to characterize ribosomal rRNA gene segments from fungi and bacteria in DNA extracted from bronchiolar lavage samples and oropharyngeal wash. Comparison to clinical culture data documents improved detection after applying the filtering procedure.Conclusions
We find increased representation of medically relevant organisms, including Candida, Cryptococcus, and Aspergillus, in subjects with increasingly severe pulmonary and immunologic deficits. We analyze covariation of fungal and bacterial taxa, and find that oropharyngeal communities rich in Candida are also rich in mitis group Streptococci, a community pattern associated with pathogenic polymicrobial biofilms. Thus, using this approach, it is possible to characterize fungal communities in the human respiratory tract more accurately and explore their interactions with bacterial communities in health and disease.Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0487-y) contains supplementary material, which is available to authorized users. 相似文献17.
Xiaolu Zou Chun Shi Ryan S. Austin Daniele Merico Seth Munholland Frédéric Marsolais Alireza Navabi William L. Crosby K. Peter Pauls Kangfu Yu Yuhai Cui 《Molecular breeding : new strategies in plant improvement》2014,33(4):769-778
Single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) are valuable molecular markers for genomics and genetics studies and molecular breeding. The advent of next-generation sequencing techniques has enabled researchers to approach high-throughput and cost-effective SNP and InDel discovery on a genomic scale. In this report, 36 common bean genotypes grown in Canada were used to construct reduced representation libraries for next-generation sequencing. Using 76 million sequence reads generated by the Illumina HiSeq 2000 Sequencing System, we identified a total of 43,698 putative SNPs and 1,267 putative InDels. Of the SNPs, 43,504 were bi-allelic and 194 were tri-allelic, and the InDels comprised 574 insertions and 693 deletions. The putative bi-allelic SNPs were distributed across all 11 chromosomes with the highest number of SNPs observed in chromosome 2 (4,788), and the lowest in chromosome 10 (2,941). With the aid of the recent release of the first chromosome-scale version of Phaseolus vulgaris, 24,907 bi-allelic SNPs, 79 tri-allelic SNPs, 315 insertions, and 377 deletions were located in 8,758, 77, 273, and 364 genes, respectively. Among these 24,907 bi-allelic SNPs, 7,168 nonsynonymous bi-allelic SNPs were identified within 36 common bean genotypes that were located in 4,303 genes. A total of 113 putative SNPs were randomly chosen for validation using high-resolution melt analysis. Of the 113 candidate SNPs, 105 (92.9 %) contained the predicted SNPs. 相似文献
18.
We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis testing problem and employ a binomial-binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall P-value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to 'accept or reject the candidates' provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/. 相似文献
19.
20.
Genotype and SNP calling from next-generation sequencing data 总被引:2,自引:0,他引:2
Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies. 相似文献