共查询到20条相似文献,搜索用时 15 毫秒
1.
Mehdi?Pirooznia Melissa?Kramer Jennifer?Parla Fernando?S?Goes James?B?Potash W?Richard?McCombie Peter?P?Zandi
Background
The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.Results
We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.Conclusions
Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.2.
Anna-Lena?Volckmar Volker?Endris Farastuk?Bozorgmehr Clemens?Lier Carlota?Porcel Martina?Kirchner Jonas?Leichsenring Roland?Penzel Michael?Thomas Peter?Schirmacher Arne?Warth Albrecht?Stenzinger
Background
Inhibition of the oncogenic fusion-gene EML4-ALK is a current first-line approach for patients with stage IV non-small cell lung cancer. While FISH was established as the gold standard for identifying these patients, there is accumulating evidence that other methods of detection, i.e., immunohistochemistry and next-generation sequencing (NGS), exist that may be equally successful. However, the concordance of these methods is under investigation.Case presentation
Adding to the current literature, we here report a 56 year old female never-smoker with stage IV lung adenocarcinoma whose biopsy was IHC and FISH inconclusive but positive in NGS. Retroactive profiling of the resection specimen corroborated fusion reads obtained by NGS, FISH-positivity and showed weak ALK-positivity by IHC. Consequently, we diagnosed the case as ALK-positive rendering the patient eligible to crizotinib treatment.Conclusions
With IHC on biopsy material only, this case would have been overlooked withholding effective therapy.3.
4.
Rosario?Carmona Macarena?Arroyo María?José?Jiménez-Quesada Pedro?Seoane Adoración?Zafra Rafael?Larrosa Juan?de Dios?Alché M.?Gonzalo?Claros
Background
Gene expression analyses demand appropriate reference genes (RGs) for normalization, in order to obtain reliable assessments. Ideally, RG expression levels should remain constant in all cells, tissues or experimental conditions under study. Housekeeping genes traditionally fulfilled this requirement, but they have been reported to be less invariant than expected; therefore, RGs should be tested and validated for every particular situation. Microarray data have been used to propose new RGs, but only a limited set of model species and conditions are available; on the contrary, RNA-seq experiments are more and more frequent and constitute a new source of candidate RGs.Results
An automated workflow based on mapped NGS reads has been constructed to obtain highly and invariantly expressed RGs based on a normalized expression in reads per mapped million and the coefficient of variation. This workflow has been tested with Roche/454 reads from reproductive tissues of olive tree (Olea europaea L.), as well as with Illumina paired-end reads from two different accessions of Arabidopsis thaliana and three different human cancers (prostate, small-cell cancer lung and lung adenocarcinoma). Candidate RGs have been proposed for each species and many of them have been previously reported as RGs in literature. Experimental validation of significant RGs in olive tree is provided to support the algorithm.Conclusion
Regardless sequencing technology, number of replicates, and library sizes, when RNA-seq experiments are designed and performed, the same datasets can be analyzed with our workflow to extract suitable RGs for subsequent PCR validation. Moreover, different subset of experimental conditions can provide different suitable RGs.5.
6.
7.
8.
Background
With the advances in the next-generation sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assignments in metagenomic samples, we developed a method that allows multiple mismatch probabilities from different genomes.Results
We extended the algorithm of taxonomic assignment of metagenomic sequence reads (TAMER) by developing an improved method that can set a different mismatch probability for each genome rather than imposing a single parameter for all genomes, thereby obtaining a greater degree of accuracy. This method, which we call TADIP (Taxonomic Assignment of metagenomics based on DIfferent Probabilities), was comprehensively tested in simulated and real datasets. The results support that TADIP improved the performance of TAMER especially in large sample size datasets with high complexity.Conclusions
TADIP was developed as a statistical model to improve the estimate accuracy of taxonomy assignments. Based on its varying mismatch probability setting and correlated variance matrix setting, its performance was enhanced for high complexity samples when compared with TAMER.9.
10.
Chao Zhao Yanan Chu Yanhong Li Chengfeng Yang Yuqing Chen Xumin Wang Bin Liu 《Biotechnology letters》2017,39(1):123-131
Objectives
To analyze the microbial diversity and gene content of a thermophilic cellulose-degrading consortium from hot springs in Xiamen, China using 454 pyrosequencing for discovering cellulolytic enzyme resources.Results
A thermophilic cellulose-degrading consortium, XM70 that was isolated from a hot spring, used sugarcane bagasse as sole carbon and energy source. DNA sequencing of the XM70 sample resulted in 349,978 reads with an average read length of 380 bases, accounting for 133,896,867 bases of sequence information. The characterization of sequencing reads and assembled contigs revealed that most microbes were derived from four phyla: Geobacillus (Firmicutes), Thermus, Bacillus, and Anoxybacillus. Twenty-eight homologous genes belonging to 15 glycoside hydrolase families were detected, including several cellulase genes. A novel hot spring metagenome-derived thermophilic cellulase was expressed and characterized.Conclusions
The application value of thermostable sugarcane bagasse-degrading enzymes is shown for production of cellulosic biofuel. The practical power of using a short-read-based metagenomic approach for harvesting novel microbial genes is also demonstrated.11.
Andres Benavides Juan Pablo Isaza Juan Pablo Niño-García Juan Fernando Alzate Felipe Cabarcas 《BMC genomics》2018,19(8):858
Background
Hot spring bacteria have unique biological adaptations to survive the extreme conditions of these environments; these bacteria produce thermostable enzymes that can be used in biotechnological and industrial applications. However, sequencing these bacteria is complex, since it is not possible to culture them. As an alternative, genome shotgun sequencing of whole microbial communities can be used. The problem is that the classification of sequences within a metagenomic dataset is very challenging particularly when they include unknown microorganisms since they lack genomic reference. We failed to recover a bacterium genome from a hot spring metagenome using the available software tools, so we develop a new tool that allowed us to recover most of this genome.Results
We present a proteobacteria draft genome reconstructed from a Colombian’s Andes hot spring metagenome. The genome seems to be from a new lineage within the family Rhodanobacteraceae of the class Gammaproteobacteria, closely related to the genus Dokdonella. We were able to generate this genome thanks to CLAME. CLAME, from Spanish “CLAsificador MEtagenomico”, is a tool to group reads in bins. We show that most reads from each bin belong to a single chromosome. CLAME is very effective recovering most of the reads belonging to the predominant species within a metagenome.Conclusions
We developed a tool that can be used to extract genomes (or parts of them) from a complex metagenome.12.
Background
Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies.Results
We propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm.Conclusions
Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.13.
Zhiliang Hu Xing Weng Chunhua Xu Yang Lin Cong Cheng Hongxia Wei Wei Chen 《Annals of clinical microbiology and antimicrobials》2018,17(1):45
Background
More than 100 different pathogens can cause encephalitis. Testing of all the neurological pathogens by conventional methods can be difficult. Metagenomic next-generation sequencing (NGS) could identify the infectious agents in a target-independent manner. The role of this novel method in clinical diagnostic microbiology still needs to be evaluated. In present study, we used metagenomic NGS to search for an infectious etiology in a human immunodeficiency virus (HIV)-infected patient with lethally diffuse brain lesions. Sequences mapping to Toxoplasma gondii were unexpectedly detected.Case presentation
A 31-year-old HIV-infected patient presented to hospital in a critical ill condition with a Glasgow coma scale score of 3. Brain magnetic resonance imaging showed diffuse brain abnormalities with contrast enhancement. Metagenomic NGS was performed on DNA extract from 300 μL patient’s cerebrospinal fluid (CSF) with the BGISEQ-50 platform. The sequencing detection identified 65,357 sequence reads uniquely aligned to the Toxoplasma gondii genome. Presence of Toxoplasma gondii genome in CSF was further verified by Toxoplasma gondii-specific polymerase chain reaction and Sanger sequencing. Altogether, those results confirmed the diagnosis of toxoplasmic encephalitis.Conclusions
This study suggests that metagenomic NGS may be a useful diagnostic tool for toxoplasmic encephalitis. As metagenomic NGS is able to identify all pathogens in a single run, it may be a promising strategy to explore the clinical causative pathogens in central nervous system infections with atypical features.14.
Velina Kozareva Clayton Stroff Maxwell Silver Jonathan F. Freidin Nigel F. Delaney 《BMC medical genomics》2018,11(1):91
Background
Detection of copy number variants (CNVs) is an important aspect of clinical testing for several disorders, including Duchenne muscular dystrophy, and is often performed using multiplex ligation-dependent probe amplification (MLPA). However, since many genetic carrier screens depend instead on next-generation sequencing (NGS) for wider discovery of small variants, they often do not include CNV analysis. Moreover, most computational techniques developed to detect CNVs from exome sequencing data are not suitable for carrier screening, as they require matched normals, very large cohorts, or extensive gene panels.Methods
We present a computational software package, geneCNV (http://github.com/vkozareva/geneCNV), which can identify exon-level CNVs using exome sequencing data from only a few genes. The tool relies on a hierarchical parametric model trained on a small cohort of reference samples.Results
Using geneCNV, we accurately inferred heterozygous CNVs in the DMD gene across a cohort of 15 test subjects. These results were validated against MLPA, the current standard for clinical CNV analysis in DMD. We also benchmarked the tool’s performance against other computational techniques and found comparable or improved CNV detection in DMD using data from panels ranging from 4,000 genes to as few as 8 genes.Conclusions
geneCNV allows for the creation of cost-effective screening panels by allowing NGS sequencing approaches to generate results equivalent to bespoke genotyping assays like MLPA. By using a parametric model to detect CNVs, it also fulfills regulatory requirements to define a reference range for a genetic test. It is freely available and can be incorporated into any Illumina sequencing pipeline to create clinical assays for detection of exon duplications and deletions.15.
Background
Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.Methods
Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method.Results
Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score?=?0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score?<?0.80) and/or failed to process one or more datasets.Conclusions
This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.16.
Elizabeth A Tindall Desiree C Petersen Stina Nikolaysen Webb Miller Stephan C Schuster Vanessa M Hayes 《BMC research notes》2010,3(1):39
Background
High-throughput custom designed genotyping arrays are a valuable resource for biologically focused research studies and increasingly for validation of variation predicted by next-generation sequencing (NGS) technologies. We investigate the Illumina GoldenGate chemistry using custom designed VeraCode and sentrix array matrix (SAM) assays for each of these applications, respectively. We highlight applications for interpretation of Illumina generated genotype cluster plots to maximise data inclusion and reduce genotyping errors.Findings
We illustrate the dramatic effect of outliers in genotype calling and data interpretation, as well as suggest simple means to avoid genotyping errors. Furthermore we present this platform as a successful method for two-cluster rare or non-autosomal variant calling. The success of high-throughput technologies to accurately call rare variants will become an essential feature for future association studies. Finally, we highlight additional advantages of the Illumina GoldenGate chemistry in generating unusually segregated cluster plots that identify potential NGS generated sequencing error resulting from minimal coverage.Conclusions
We demonstrate the importance of visually inspecting genotype cluster plots generated by the Illumina software and issue warnings regarding commonly accepted quality control parameters. In addition to suggesting applications to minimise data exclusion, we propose that the Illumina cluster plots may be helpful in identifying potential in-put sequence errors, particularly important for studies to validate NGS generated variation.17.
Background
The reconstruction of ancestral genomes must deal with the problem of resolution, necessarily involving a trade-off between trying to identify genomic details and being overwhelmed by noise at higher resolutions.Results
We use the median reconstruction at the synteny block level, of the ancestral genome of the order Gentianales, based on coffee, Rhazya stricta and grape, to exemplify the effects of resolution (granularity) on comparative genomic analyses.Conclusions
We show how decreased resolution blurs the differences between evolving genomes, with respect to rate, mutational process and other characteristics.18.
Background
Pseudogenes are inheritable genetic elements showing sequence similarity to functional genes but with deleterious mutations. We describe a computational pipeline for identifying them, which in contrast to previous work explicitly uses intron-exon structure in parent genes to classify pseudogenes. We require alignments between duplicated pseudogenes and their parents to span intron-exon junctions, and this can be used to distinguish between true duplicated and processed pseudogenes (with insertions).Results
Applying our approach to the ENCODE regions, we identify about 160 pseudogenes, 10% of which have clear 'intron-exon' structure and are thus likely generated from recent duplications.Conclusion
Detailed examination of our results and comparison of our annotation with the GENCODE reference annotation demonstrate that our computation pipeline provides a good balance between identifying all pseudogenes and delineating the precise structure of duplicated genes.20.