期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compression of Structured High-Throughput Sequencing Data

Fabien Campagne Kevin C. Dorff Nyasha Chambwe James T. Robinson Jill P. Mesirov 《PloS one》2013,8(11)

Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays. 相似文献

2.

Compression of DNA sequence reads in FASTQ format

Deorowicz S Grabowski S 《Bioinformatics (Oxford, England)》2011,27(6):860-862

相似文献

3.

Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines

Matthew Frampton Richard Houlston 《PloS one》2012,7(11)

Pipelines for the analysis of Next-Generation Sequencing (NGS) data are generally composed of a set of different publicly available software, configured together in order to map short reads of a genome and call variants. The fidelity of pipelines is variable. We have developed ArtificialFastqGenerator, which takes a reference genome sequence as input and outputs artificial paired-end FASTQ files containing Phred quality scores. Since these artificial FASTQs are derived from the reference genome, it provides a gold-standard for read-alignment and variant-calling, thereby enabling the performance of any NGS pipeline to be evaluated. The user can customise DNA template/read length, the modelling of coverage based on GC content, whether to use real Phred base quality scores taken from existing FASTQ files, and whether to simulate sequencing errors. Detailed coverage and error summary statistics are outputted. Here we describe ArtificialFastqGenerator and illustrate its implementation in evaluating a typical bespoke NGS analysis pipeline under different experimental conditions. ArtificialFastqGenerator was released in January 2012. Source code, example files and binaries are freely available under the terms of the GNU General Public License v3.0. from https://sourceforge.net/projects/artfastqgen/. 相似文献

4.

心电数据管理系统中标准数据格式的分析

邱四海曾永华《上海生物医学工程》2010,(2):63-67

介绍了心电图机中三种标准化的存储格式,SCP,HL7aECG,DICOM,对这三种数据格式结构和主要特点进行了分析比较,对各自的优缺点进行了对比,随着网格化的发展,市场上会有会有越来越多的产家支持这几种通用的心电数据存储方式。相似文献

5.

Light-weight reference-based compression of FASTQ data

Yongpeng Zhang Linsen Li Yanli Yang Xiao Yang Shan He Zexuan Zhu 《BMC bioinformatics》2015,16(1)

Background

The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference.

Results

This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms.

Conclusions

LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip. 相似文献

6.

The Sanger FASTQ file format for sequences with quality scores,and the Solexa/Illumina FASTQ variants

Peter J. A. Cock Christopher J. Fields Naohisa Goto Michael L. Heuer Peter M. Rice 《Nucleic acids research》2010,38(6):1767-1771

相似文献

7.

JPEG2000压缩算法及其在医学图像DICOM格式中的实现

夏新郑西川《上海生物医学工程》2008,29(1):18-21

本文描述了在医学图像DICOM格式中实现JPEG2000压缩算法的编程思路和方法。提供了部分VC＋＋代码,对关键函数进行了较详细的注解,并对两种图像压缩格式进行了比较,给出了一个详细的实验结果。相似文献

8.

Statistical Design and Analysis of RNA Sequencing Data

Paul L. Auer R. W. Doerge 《Genetics》2010,185(2):405-416

相似文献

9.

Genotype-Frequency Estimation from High-Throughput Sequencing Data

Takahiro Maruki Michael Lynch 《Genetics》2015,201(2):473-486

Rapidly improving high-throughput sequencing technologies provide unprecedented opportunities for carrying out population-genomic studies with various organisms. To take full advantage of these methods, it is essential to correctly estimate allele and genotype frequencies, and here we present a maximum-likelihood method that accomplishes these tasks. The proposed method fully accounts for uncertainties resulting from sequencing errors and biparental chromosome sampling and yields essentially unbiased estimates with minimal sampling variances with moderately high depths of coverage regardless of a mating system and structure of the population. Moreover, we have developed statistical tests for examining the significance of polymorphisms and their genotypic deviations from Hardy–Weinberg equilibrium. We examine the performance of the proposed method by computer simulations and apply it to low-coverage human data generated by high-throughput sequencing. The results show that the proposed method improves our ability to carry out population-genomic analyses in important ways. The software package of the proposed method is freely available from https://github.com/Takahiro-Maruki/Package-GFE. 相似文献

10.

Pneumocystis carinti erg6 Gene: Sequencing and Expression of Recombinant SAM:Sterol Methyltransferase in Heterologous Systems

EDNA S. KANESHIRO JILL A. ROSENFELD MIREILLE BASSELIN SUZANNE BRADSHAW JAMES R. STRINGER A. GEORGE SMULIAN JOSÉ-L. GINER 《The Journal of eukaryotic microbiology》2001,48(S1):144s-146s

相似文献

11.

Data release in Human Genome Sequencing Projects

Sugawara H 《Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme》2003,48(13):1857-1862

相似文献

12.

OTU Analysis Using Metagenomic Shotgun Sequencing Data

Xiaolin Hao Ting Chen 《PloS one》2012,7(11)

Because of technological limitations, the primer and amplification biases in targeted sequencing of 16S rRNA genes have veiled the true microbial diversity underlying environmental samples. However, the protocol of metagenomic shotgun sequencing provides 16S rRNA gene fragment data with natural immunity against the biases raised during priming and thus the potential of uncovering the true structure of microbial community by giving more accurate predictions of operational taxonomic units (OTUs). Nonetheless, the lack of statistically rigorous comparison between 16S rRNA gene fragments and other data types makes it difficult to interpret previously reported results using 16S rRNA gene fragments. Therefore, in the present work, we established a standard analysis pipeline that would help confirm if the differences in the data are true or are just due to potential technical bias. This pipeline is built by using simulated data to find optimal mapping and OTU prediction methods. The comparison between simulated datasets revealed a relationship between 16S rRNA gene fragments and full-length 16S rRNA sequences that a 16S rRNA gene fragment having a length >150 bp provides the same accuracy as a full-length 16S rRNA sequence using our proposed pipeline, which could serve as a good starting point for experimental design and making the comparison between 16S rRNA gene fragment-based and targeted 16S rRNA sequencing-based surveys possible. 相似文献

13.

高通量测序序列比对研究综述

《生命科学研究》2014,(5):458-464

高通量测序技术的飞速发展,给生物信息学带来了新的机遇和挑战,第二代测序序列数量多、长度短使得原来的序列分析手段不再适用。近几年来,针对高通量测序的序列分析算法和软件日益增多,目前已有上百种,导致选择合适的软件成为一个难题。对第二代测序的测序类型、序列类型以及分析算法进行了总结和归纳,对现今常用的分析软件的序列的类型、长度以及软件应用算法、输入/输出格式、特点和功能等方面做了详细分析和比较并给出建议。分析了现今测序技术和序列分析存在的问题,预测了今后的发展方向。相似文献

14.

ANGSD: Analysis of Next Generation Sequencing Data

Thorfinn Sand Korneliussen Anders Albrechtsen Rasmus Nielsen 《BMC bioinformatics》2014,15(1)

Background

High-throughput DNA sequencing technologies are generating vast amounts of data. Fast, flexible and memory efficient implementations are needed in order to facilitate analyses of thousands of samples simultaneously.

Results

We present a multithreaded program suite called ANGSD. This program can calculate various summary statistics, and perform association mapping and population genetic analyses utilizing the full information in next generation sequencing data by working directly on the raw sequencing data or by using genotype likelihoods.

Conclusions

The open source c/c++ program ANGSD is available at http://www.popgen.dk/angsd. The program is tested and validated on GNU/Linux systems. The program facilitates multiple input formats including BAM and imputed beagle genotype probability files. The program allow the user to choose between combinations of existing methods and can perform analysis that is not implemented elsewhere.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0356-4) contains supplementary material, which is available to authorized users. 相似文献

15.

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data

Catalina A. Vallejos John C. Marioni Sylvia Richardson 《PLoS computational biology》2015,11(6)

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell’s lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach. 相似文献

16.

Common themes and differences in SAM recognition among SAM riboswitches

Ian R. Price Jason C. GriggAilong Ke 《Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms》2014,1839(10):931-938

The recent discovery of short cis-acting RNA elements termed riboswitches has caused a paradigm shift in our understanding of genetic regulatory mechanisms. The three distinct superfamilies of S-adenosyl-l-methionine (SAM) riboswitches are the most commonly found riboswitch classes in nature. These RNAs represent three independent evolutionary solutions to achieve specific SAM recognition. This review summarizes research on 1) modes of gene regulatory mechanisms, 2) common themes and differences in ligand recognition, and 3) ligand-induced conformational dynamics among SAM riboswitch families. The body of work on the SAM riboswitch families constitutes a useful primer to the topic of gene regulatory RNAs as a whole. 相似文献

17.

Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data

Matteo Fumagalli Filipe G. Vieira Thorfinn Sand Korneliussen Tyler Linderoth Emilia Huerta-Sánchez Anders Albrechtsen Rasmus Nielsen 《Genetics》2013,195(3):979-992

Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage. 相似文献

18.

二代基因测序数据管理和大数据平台在精准医学中的应用

武奥申刘小娜刘昀赫刘刚刘雷《中国生物工程杂志》2019,39(2):101-111

精准医学集合了多种数据,包括组学、临床、环境和行为等,是对疾病进行个性化治疗、预防和管理的科学。随着基因测序费用的大幅下降,人们对肿瘤等疾病的认识从传统病理到分子水平的飞跃等,相关科学的发展和普及推动了精准医学的诞生和发展,将更加深远地影响着人类的健康。本文介绍了精准医学的概念、目的及应用,介绍了二代DNA测序技术在精准医学中的应用,认为基因组学数据、样本管理、数据质量控制标准以及数据管理平台等是实现精准医学的基础,智能化精准医疗将是来的发展方向。进行展望的同时,也认为基因组学海量数据的规模特点、各种健康应用在推动数据管理平台的发展的同时,也对其演进提出了挑战。相似文献

19.

Estimation of Copy Number Alterations from Exome Sequencing Data

Rafael Valdés-Mas Silvia Bea Diana A. Puente Carlos López-Otín Xose S. Puente 《PloS one》2012,7(12)

Exome sequencing constitutes an important technology for the study of human hereditary diseases and cancer. However, the ability of this approach to identify copy number alterations in primary tumor samples has not been fully addressed. Here we show that somatic copy number alterations can be reliably estimated using exome sequencing data through a strategy that we have termed exome2cnv. Using data from 86 paired normal and primary tumor samples, we identified losses and gains of complete chromosomes or large genomic regions, as well as smaller regions affecting a minimum of one gene. Comparison with high-resolution comparative genomic hybridization (CGH) arrays revealed a high sensitivity and a low number of false positives in the copy number estimation between both approaches. We explore the main factors affecting sensitivity and false positives with real data, and provide a side by side comparison with CGH arrays. Together, these results underscore the utility of exome sequencing to study cancer samples by allowing not only the identification of substitutions and indels, but also the accurate estimation of copy number alterations. 相似文献

20.

Optical Data Compression in Time Stretch Imaging

Claire Lifan Chen Ata Mahjoubfar Bahram Jalali 《PloS one》2015,10(4)

Time stretch imaging offers real-time image acquisition at millions of frames per second and subnanosecond shutter speed, and has enabled detection of rare cancer cells in blood with record throughput and specificity. An unintended consequence of high throughput image acquisition is the massive amount of digital data generated by the instrument. Here we report the first experimental demonstration of real-time optical image compression applied to time stretch imaging. By exploiting the sparsity of the image, we reduce the number of samples and the amount of data generated by the time stretch camera in our proof-of-concept experiments by about three times. Optical data compression addresses the big data predicament in such systems. 相似文献