首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required.

Results

We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure.

Conclusions

Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
  相似文献   

2.

Background

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs.

Results

In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package.

Conclusions

In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.
  相似文献   

3.
SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.  相似文献   

4.
MOTIVATION: MethylCoder is a software program that generates per-base methylation data given a set of bisulfite-treated reads. It provides the option to use either of two existing short-read aligners, each with different strengths. It accounts for soft-masked alignments and overlapping paired-end reads. MethylCoder outputs data in text and binary formats in addition to the final alignment in SAM format, so that common high-throughput sequencing tools can be used on the resulting output. It is more flexible than existing software and competitive in terms of speed and memory use. AVAILABILITY: MethylCoder requires only a python interpreter and a C compiler to run. Extensive documentation and the full source code are available under the MIT license at: https://github.com/brentp/methylcode. CONTACT: bpederse@gmail.com.  相似文献   

5.
SUMMARY: Bambino is a variant detector and graphical alignment viewer for next-generation sequencing data in the SAM/BAM format, which is capable of pooling data from multiple source files. The variant detector takes advantage of SAM-specific annotations, and produces detailed output suitable for genotyping and identification of somatic mutations. The assembly viewer can display reads in the context of either a user-provided or automatically generated reference sequence, retrieve genome annotation features from a UCSC genome annotation database, display histograms of non-reference allele frequencies, and predict protein-coding changes caused by SNPs. AVAILABILITY: Bambino is written in platform-independent Java and available from https://cgwb.nci.nih.gov/goldenPath/bamview/documentation/index.html, along with documentation and example data. Bambino may be launched online via Java Web Start or downloaded and run locally.  相似文献   

6.

Background  

Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information.  相似文献   

7.
With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/  相似文献   

8.
SUMMARY: Nexplorer is a web-based program for interactive browsing and manipulation of character data in NEXUS format, well suited for use with alignments and trees representing families of homologous genes or proteins. Users may upload a sequence family dataset, or choose from one of several thousand already available. Nexplorer provides a flexible means to develop customized views that combine a tree and a data matrix or alignment, to create subsets of data, and to output data files or publication-quality graphics. AVAILABILITY: Web access is from http://www.molevol.org/nexplorer  相似文献   

9.
10.
11.
The NEXUS Class Library (NCL) is a collection of C++ classes designed to simplify interpreting data files written in the NEXUS format used by many computer programs for phylogenetic analyses. The NEXUS format allows different programs to share the same data files, even though none of the programs can interpret all of the data stored therein. Because users are not required to reformat the data file for each program, use of the NEXUS format prevents cut-and-paste errors as well as the proliferation of copies of the original data file. The purpose of making the NCL available is to encourage the use of the NEXUS format by making it relatively easy for programmers to add the ability to interpret NEXUS files in newly developed software. AVAILABILITY: The NCL is freely available under the GNU General Public License from http://hydrodictyon.eeb.uconn.edu/ncl/ Supplementary information: Documentation for the NCL (general information and source code documentation) is available in HTML format at http://hydrodictyon.eeb.uconn.edu/ncl/  相似文献   

12.
13.
The SFF file format produced by Roche's 454 sequencing technology is a compact, binary format that contains the flow values that are used for base and quality calling of the reads. Applications, e.g. in metagenomics, often depend on accurate sequence information, and access to flow values is important to estimate the probability of errors. Unfortunately, the programs supplied by Roche for accessing this information are not publicly available. Flower is a program that can extract the information contained in SFF files, and convert it to various textual output formats. AVAILABILITY: Flower is freely available under the General Public License.  相似文献   

14.
15.
convert is a user‐friendly, 32‐bit Windows program that facilitates ready transfer of codominant, diploid genotypic data amongst commonly used population genetic software packages. convert reads input files in its own ‘standard’ data format, easily produced from an excel file of diploid, codominant marker data, and can convert these to the input formats of the following programs: gda , genepop , arlequin , popgene , microsat , phylip , and structure . convert can also read input files in genepop format. In addition, convert can produce a summary table of allele frequencies in which private alleles and the sample sizes at each locus are indicated.  相似文献   

16.
Second-generation sequencing is increasingly being used in combination with genome-enrichment techniques to amplify a large number of loci in many individuals for the purpose of population genetic and phylogeographic analysis. Compiling all the necessary tools to analyse these data is complex and time-consuming. Here, we assemble a set of programs and pipe them together with Perl, enabling research laboratories without a dedicated bioinformatician to utilize second-generation sequencing. User input is a folder of the second-generation sequencing reads sorted by individual (in FASTA format) and pipeline output is a folder of multi-FASTA files that correspond to loci (with 2 alleles called per individual). Additional output includes a summary file of the number of individuals per locus, observed and expected heterozygosity for each locus, distribution of multiple hits and summary statistics (θ, Tajima's D, etc.). This user-friendly, open source pipeline, which requires no a priori reference genome because it constructs its own, allows the user to set various parameters (e.g. minimum coverage) in the dependent programs (CAP3, BWA, SAMtools and VarScan) and facilitates evaluation of the nature and quality of data collected prior to analysis in software packages.  相似文献   

17.
Lee W  Chen SL 《BioTechniques》2002,33(6):1334-1341
Genome-tools is a Perl module, a set of programs, and a user interface that facilitates access to genome sequence information. The package is flexible, extensible, and designed to be accessible and useful to both nonprogrammers and programmers. Any relatively well-annotated genome available with standard GenBank genome files may be used with genome-tools. A simple Web-based front end permits searching any available genome with an intuitive interface. Flexible design choices also make it simple to handle revised versions of genome annotation files as they change. In addition, programmers can develop cross-genomic tools and analyses with minimal additional overhead by combining genome-tools modules with newly written modules. Genome-tools runs on any computer platform for which Perl is available, including Unix, Microsoft Windows, and Mac OS. By simplifying the access to large amounts of genomic data, genome-tools may be especially useful for molecular biologists looking at newly sequenced genomes, for which few informatics tools are available. The genome-tools Web interface is accessible at http://genome-tools.sourceforge.net, and the source code is available at http://sourceforge.net/projects/genome-tools.  相似文献   

18.
Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames,start sites,splice sites,and related structural features.The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures.In addition,the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations,nor do they represent these annotations in a format consistent with current file standards.These frameworks also lack consideration for functional attributes,such as the presence or absence of protein domains that can be used for gene model validation.To provide oversight to the increasing number of published genome annotations,we present a software package,the Gene Filtering,Analysis,and Conversion(gFACs),to filter,analyze,and convert predicted gene models and alignments.The software operates across a wide range of alignment,analysis,and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes.gFACs supports common downstream applications,including genome browsers,and generates extensive details on the filtering process,including distributions that can be visualized to further assess the proposed gene space.gFACs is freely available and implemented in Perl with support from Bio Perl libraries at https://gitlab.com/Plant Genomics Lab/gFACs.  相似文献   

19.
20.
The Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. The derivation and significance of CKAAPs starting from pairwise structure alignments is described fully in Reddy et al. [Reddy,B.V.B., Li,W.W., Shindyalov,I.N. and Bourne,P.E. (2000) PROTEINS:, in press]. The CKAAPs identified from this theoretical analysis are provided to experimentalists and theoreticians for potential use in protein engineering and modeling. It has been suggested that CKAAPs may be crucial features for protein folding, structural stability and function. Over 170 substructures, as defined by the Combinatorial Extension (CE) database, which are found in approximately 3000 representative polypeptide chains have been analyzed and are available in the CKAAPs DB. CKAAPs DB also provides CKAAPs of the representative set of proteins derived from the CE and FSSP databases. Thus the database contains over 5000 representative poly-peptide chains, covering all known structures in the PDB. A web interface to a relational database permits fast retrieval of structure-sequence alignments, CKAAPs and associated statistics. Users may query by PDB ID, protein name, function and Enzyme Classification number. Users may also submit protein alignments of their own to obtain CKAAPs. An interface to display CKAAPs on each structure from a web browser is also being implemented. CKAAPs DB is maintained by the San Diego Supercomputer Center and accessible at the URL http://ckaaps.sdsc.edu.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号