首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sakib MN  Tang J  Zheng WJ  Huang CT 《PloS one》2011,6(12):e28251
Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly.  相似文献   

2.
In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.  相似文献   

3.
In this paper, we study various lossless compression techniques for electroencephalograph (EEG) signals. We discuss a computationally simple pre-processing technique, where EEG signal is arranged in the form of a matrix (2-D) before compression. We discuss a two-stage coder to compress the EEG matrix, with a lossy coding layer (SPIHT) and residual coding layer (arithmetic coding). This coder is optimally tuned to utilize the source memory and the i.i.d. nature of the residual. We also investigate and compare EEG compression with other schemes such as JPEG2000 image compression standard, predictive coding based shorten, and simple entropy coding. The compression algorithms are tested with University of Bonn database and Physiobank Motor/Mental Imagery database. 2-D based compression schemes yielded higher lossless compression compared to the standard vector-based compression, predictive and entropy coding schemes. The use of pre-processing technique resulted in 6% improvement, and the two-stage coder yielded a further improvement of 3% in compression performance.  相似文献   

4.
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose.  相似文献   

5.

Background

The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference.

Results

This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms.

Conclusions

LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip.  相似文献   

6.
Software based efficient and reliable ECG data compression and transmission scheme is proposed here. The algorithm has been applied to various ECG data of all the 12 leads taken from PTB diagnostic ECG database (PTB-DB). First of all, R-peaks are detected by differentiation and squaring technique and QRS regions are located. To achieve a strict lossless compression in the QRS regions and a tolerable lossy compression in rest of the signal, two different compression algorithms have used. The whole compression scheme is such that the compressed file contains only ASCII characters. These characters are transmitted using internet based Short Message Service (SMS) and at the receiving end, original ECG signal is brought back using just the reverse logic of compression. It is observed that the proposed algorithm can reduce the file size significantly (compression ratio: 22.47) preserving ECG signal morphology.  相似文献   

7.
Conduction of tele-3D-computer assisted operations as well as other telemedicine procedures often requires highest possible quality of transmitted medical images and video. Unfortunately, those data types are always associated with high telecommunication and storage costs that sometimes prevent more frequent usage of such procedures. We present a novel algorithm for lossless compression of medical images that is extremely helpful in reducing the telecommunication and storage costs. The algorithm models the image properties around the current, unknown pixel and adjusts itself to the local image region. The main contribution of this work is the enhancement of the well known approach of predictor blends through highly adaptive determination of blending context on a pixel-by-pixel basis using classification technique. We show that this approach is well suited for medical image data compression. Results obtained with the proposed compression method on medical images are very encouraging, beating several well known lossless compression methods. The predictor proposed can also be used in other image processing applications such as segmentation and extraction of image regions.  相似文献   

8.
Image compression is an application of data compression on digital images. Several lossy/lossless transform coding techniques are used for image compression. Discrete cosine transform (DCT) is one such widely used technique. A variation of DCT, known as warped discrete cosine transform (WDCT), is used for 2-D image compression and it is shown to perform better than the DCT at high bit-rates. We extend this concept and develop the 3-D WDCT, a transform that has not been previously investigated. We outline some of its important properties, which make it especially suitable for image compression. We then propose a complete image coding scheme for volumetric data sets based on the 3-D WDCT scheme. It is shown that the 3-D WDCT-based compression scheme performs better than a similar 3-D DCT scheme for volumetric data sets at high bit-rates.  相似文献   

9.

Background

As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed.

Results

Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs.

Conclusions

The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.  相似文献   

10.
Genome sequencing and microarray technology produce ever-increasing amounts of complex data that need analysis. Visualization is an effective analytical technique that exploits the ability of the human brain to process large amounts of data. Here, we review traditional visualization methods based on clustering and tree representation, and also describe an alternative approach that involves projecting objects onto a Euclidean space in a way that reflects their structural or functional distances. Data are visualized without preclustering and can be dynamically explored by the user using ‘virtual-reality’. We illustrate this approach with two case studies from protein topology and gene expression.  相似文献   

11.
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.  相似文献   

12.

We revisit the surface plasmon resonances established along a planar interface lying between a lossless dielectric and a lossy metal. By examining the orbital and spin parts of the Poynting vector, the mechanisms behind forward or backward flows are clearly illustrated. Consequently, we were able to construct more intuitive pictures of two-dimensional energy flows induced by the metallic losses. In addition, we recognized the importance of both asymmetry and symmetry hidden behind the familiar transverse-magnetic waves. Our numerical results are close to reality, since experimentally observed optical data of gold is employed for a lossy metal.

  相似文献   

13.
Next-generation sequencing (NGS) has transformed molecular biology and contributed to many seminal insights into genomic regulation and function. Apart from whole-genome sequencing, an NGS workflow involves alignment of the sequencing reads to the genome of study, after which the resulting alignments can be used for downstream analyses. However, alignment is complicated by the repetitive sequences; many reads align to more than one genomic locus, with 15–30% of the genome not being uniquely mappable by short-read NGS. This problem is typically addressed by discarding reads that do not uniquely map to the genome, but this practice can lead to systematic distortion of the data. Previous studies that developed methods for handling ambiguously mapped reads were often of limited applicability or were computationally intensive, hindering their broader usage. In this work, we present SmartMap: an algorithm that augments industry-standard aligners to enable usage of ambiguously mapped reads by assigning weights to each alignment with Bayesian analysis of the read distribution and alignment quality. SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads in approximately one hour on a desktop PC. By applying SmartMap to peak-type NGS data, including MNase-seq, ChIP-seq, and ATAC-seq in three organisms, we can increase read depth by up to 53% and increase the mapped proportion of the genome by up to 18% compared to analyses utilizing only uniquely mapped reads. We further show that SmartMap enables the analysis of more than 140,000 repetitive elements that could not be analyzed by traditional ChIP-seq workflows, and we utilize this method to gain insight into the epigenetic regulation of different classes of repetitive elements. These data emphasize both the dangers of discarding ambiguously mapped reads and their power for driving biological discovery.  相似文献   

14.
We have identified cDNA clones coding for the major sulphur-rich and sulphur-poor groups of barley storage proteins (the B- and C-hordeins, respectively). Hybridization studies have revealed unexpected homologies between B- and C-hordein mRNAs. Using a deletion mutant (Risø 56), we have mapped some C-hordein-related sequences within, or closely associated with, B-hordein genes at the Hor 2 locus. Nucleotide sequencing has shown that the primary structure of B-hordein polypeptides can be divided into at least two domains: domain 1 (repetitive, proline-rich, sulphur-poor), which is homologous to C-hordein sequences, and domain 2 (non-repetitive, proline-poor, sulphur-rich), which makes up two-thirds of the polypeptide and is partially homologous to a 2S globulin storage protein found in dicotyledons. The coding sequences that are homologous in B- and C-hordein mRNAs have an asymmetric base composition (>80% C-A) and are largely composed of a degenerate tandem repeat based on a 24 nucleotide consensus that encodes Pro-Gln-Gln-Pro-Phe-Pro-Gln-Gln. We discuss the evolutionary implications of the domain structure of the B-hordeins and the unusual relationship between the two groups of barley storage proteins.  相似文献   

15.
In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions.  相似文献   

16.
The rapidly growing amount of genomic sequence data being generated and made publicly available necessitate the development of new data storage and archiving methods. The vast amount of data being shared and manipulated also create new challenges for network resources. Thus, developing advanced data compression techniques is becoming an integral part of data production and analysis. The HapMap project is one of the largest public resources of human single-nucleotide polymorphisms (SNPs), characterizing over 3 million SNPs genotyped in over 1000 individuals. The standard format and biological properties of HapMap data suggest that a dedicated genetic compression method can outperform generic compression tools. We propose a compression methodology for genetic data by introducing HapZipper, a lossless compression tool tailored to compress HapMap data beyond benchmarks defined by generic tools such as gzip, bzip2 and lzma. We demonstrate the usefulness of HapZipper by compressing HapMap 3 populations to <5% of their original sizes. HapZipper is freely downloadable from https://bitbucket.org/pchanda/hapzipper/downloads/HapZipper.tar.bz2.  相似文献   

17.
The introduction of fast CMOS detectors is moving the field of transmission electron microscopy into the computer science field of big data. Automated data pipelines control the instrument and initial processing steps which imposes more onerous data transfer and archiving requirements. Here we conduct a technical demonstration whereby storage and read/write times are improved 10× at a dose rate of 1 e?/pix/frame for data from a Gatan K2 direct-detection device by combination of integer decimation and lossless compression. The example project is hosted at github.com/em-MRCZ and released under the BSD license.  相似文献   

18.
Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.  相似文献   

19.
Recurrent deletions have been associated with numerous diseases and genomic disorders. Few, however, have been resolved at the molecular level because their breakpoints often occur in highly copy-number-polymorphic duplicated sequences. We present an approach that uses a combination of somatic cell hybrids, array comparative genomic hybridization, and the specificity of next-generation sequencing to determine breakpoints that occur within segmental duplications. Applying our technique to the 17q21.31 microdeletion syndrome, we used genome sequencing to determine copy-number-variant breakpoints in three deletion-bearing individuals with molecular resolution. For two cases, we observed breakpoints consistent with nonallelic homologous recombination involving only H2 chromosomal haplotypes, as expected. Molecular resolution revealed that the breakpoints occurred at different locations within a 145 kbp segment of >99% identity and disrupt KANSL1 (previously known as KANSL1). In the remaining case, we found that unequal crossover occurred interchromosomally between the H1 and H2 haplotypes and that this event was mediated by a homologous sequence that was once again missing from the human reference. Interestingly, the breakpoints mapped preferentially to gaps in the current reference genome assembly, which we resolved in this study. Our method provides a strategy for the identification of breakpoints within complex regions of the genome harboring high-identity and copy-number-polymorphic segmental duplication. The approach should become particularly useful as high-quality alternate reference sequences become available and genome sequencing of individuals'' DNA becomes more routine.  相似文献   

20.
In this study we analyze one year of anonymized telecommunications data for over one million customers from a large European cellphone operator, and we investigate the relationship between people''s calls and their physical location. We discover that more than 90% of users who have called each other have also shared the same space (cell tower), even if they live far apart. Moreover, we find that close to 70% of users who call each other frequently (at least once per month on average) have shared the same space at the same time - an instance that we call co-location. Co-locations appear indicative of coordination calls, which occur just before face-to-face meetings. Their number is highly predictable based on the amount of calls between two users and the distance between their home locations - suggesting a new way to quantify the interplay between telecommunications and face-to-face interactions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号