首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
2.
The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost.  相似文献   

3.
Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.  相似文献   

4.
Pipelines for the analysis of Next-Generation Sequencing (NGS) data are generally composed of a set of different publicly available software, configured together in order to map short reads of a genome and call variants. The fidelity of pipelines is variable. We have developed ArtificialFastqGenerator, which takes a reference genome sequence as input and outputs artificial paired-end FASTQ files containing Phred quality scores. Since these artificial FASTQs are derived from the reference genome, it provides a gold-standard for read-alignment and variant-calling, thereby enabling the performance of any NGS pipeline to be evaluated. The user can customise DNA template/read length, the modelling of coverage based on GC content, whether to use real Phred base quality scores taken from existing FASTQ files, and whether to simulate sequencing errors. Detailed coverage and error summary statistics are outputted. Here we describe ArtificialFastqGenerator and illustrate its implementation in evaluating a typical bespoke NGS analysis pipeline under different experimental conditions. ArtificialFastqGenerator was released in January 2012. Source code, example files and binaries are freely available under the terms of the GNU General Public License v3.0. from https://sourceforge.net/projects/artfastqgen/.  相似文献   

5.
6.
The big data storage is a challenge in a post genome era. Hence, there is a need for high performance computing solutions for managing large genomic data. Therefore, it is of interest to describe a parallel-computing approach using message-passing library for distributing the different compression stages in clusters. The genomic compression helps to reduce the on disk“foot print” of large data volumes of sequences. This supports the computational infrastructure for a more efficient archiving. The approach was shown to find utility in 21 Eukaryotic genomes using stratified sampling in this report. The method achieves an average of 6-fold disk space reduction with three times better compression time than COMRAD.

Availability

The source codes are written in C using message passing libraries and are available at https:// sourceforge.net/ projects/ comradmpi/files / COMRADMPI/  相似文献   

7.
8.
9.
10.
11.
Protein carbonylation is one of the most pervasive oxidative stress-induced post-translational modifications (PTMs), which plays a significant role in the etiology and progression of several human diseases. It has been regarded as a biomarker of oxidative stress due to its relatively early formation and stability compared with other oxidative PTMs. Only a subset of proteins is prone to carbonylation and most carbonyl groups are formed from lysine (K), arginine (R), threonine (T) and proline (P) residues. Recent advancements in analysis of the PTM by mass spectrometry provided new insights into the mechanisms of protein carbonylation, such as protein susceptibility and exact modification sites. However, the experimental approaches to identifying carbonylation sites are costly, time-consuming and capable of processing a limited number of proteins, and there is no bioinformatics method or tool devoted to predicting carbonylation sites of human proteins so far. In the paper, a computational method is proposed to identify carbonylation sites of human proteins. The method extracted four kinds of features and combined the minimum Redundancy Maximum Relevance (mRMR) feature selection criterion with weighted support vector machine (WSVM) to achieve total accuracies of 85.72%, 85.95%, 83.92% and 85.72% for K, R, T and P carbonylation site predictions respectively using 10-fold cross-validation. The final optimal feature sets were analysed, the position-specific composition and hydrophobicity environment of flanking residues of modification sites were discussed. In addition, a software tool named CarSPred has been developed to facilitate the application of the method. Datasets and the software involved in the paper are available at https://sourceforge.net/projects/hqlstudio/files/CarSPred-1.0/.  相似文献   

12.
Epistasis is a ubiquitous phenomenon in genetics, and is considered to be one of the main factors in current efforts to detect missing heritability for complex diseases. Simulation is a critical tool in developing methodologies that can more effectively detect and study epistasis. Here we present a simulator, epiSIM (epistasis SIMulator), that can simulate some of the statistical properties of genetic data. EpiSIM is capable of expanding the range of the epistasis models that current simulators offer, including epistasis models that display marginal effects and those that display no marginal effects. One or more of these epistasis models can be embedded simultaneously into a single simulation data set, jointly determining the phenotype. In addition, epiSIM is independent of any outside data source in generating linkage disequilibrium patterns and haplotype blocks. We demonstrate the wide applicability of epiSIM by performing several data simulations, and examine its properties by comparing it with current representative simulators and by comparing the data that it generates with real data. Our experiments demonstrate that epiSIM is a valuable addition and a nice complement to the existing epistasis simulators. The software package is available online at https://sourceforge.net/projects/episimsimulator/files/.  相似文献   

13.
Next-generation sequencing technologies have increased the amount of biological data generated. Thus, bioinformatics has become important because new methods and algorithms are necessary to manipulate and process such data. However, certain challenges have emerged, such as genome assembly using short reads and high-throughput platforms. In this context, several algorithms have been developed, such as Velvet, Abyss, Euler-SR, Mira, Edna, Maq, SHRiMP, Newbler, ALLPATHS, Bowtie and BWA. However, most such assemblers do not have a graphical interface, which makes their use difficult for users without computing experience given the complexity of the assembler syntax. Thus, to make the operation of such assemblers accessible to users without a computing background, we developed AutoAssemblyD, which is a graphical tool for genome assembly submission and remote management by multiple assemblers through XML templates.

Availability

AssemblyD is freely available at https://sourceforge.net/projects/autoassemblyd. It requires Sun jdk 6 or higher.  相似文献   

14.
Proteogenomic approaches have gained increasing popularity, however it is still difficult to integrate mass spectrometry identifications with genomic data due to differing data formats. To address this difficulty, we introduce iPiG as a tool for the integration of peptide identifications from mass spectrometry experiments into existing genome browser visualizations. Thereby, the concurrent analysis of proteomic and genomic data is simplified and proteomic results can directly be compared to genomic data. iPiG is freely available from https://sourceforge.net/projects/ipig/. It is implemented in Java and can be run as a stand-alone tool with a graphical user-interface or integrated into existing workflows. Supplementary data are available at PLOS ONE online.  相似文献   

15.
RNase H (RNH) is a pivotal domain in retrovirus to cleave the DNA-RNA hybrid for continuing retroviral replication. The crucial role indicates that RNH is a promising drug target for therapeutic intervention. However, annotated RNHs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. In this work, a computational RNH model was proposed to annotate new putative RNHs (np-RNHs) in the retroviruses. It basically predicts RNH domains through recognizing their start and end sites separately with SVM method. The classification accuracy rates are 100%, 99.01% and 97.52% respectively corresponding to jack-knife, 10-fold cross-validation and 5-fold cross-validation test. Subsequently, this model discovered 14,033 np-RNHs after scanning sequences without RNH annotations. All these predicted np-RNHs and annotated RNHs were employed to analyze the length, hydrophobicity and evolutionary relationship of RNH domains. They are all related to retroviral genera, which validates the classification of retroviruses to a certain degree. In the end, a software tool was designed for the application of our prediction model. The software together with datasets involved in this paper can be available for free download at https://sourceforge.net/projects/rhtool/files/?source=navbar.  相似文献   

16.
We present an integrated stand-alone software package named KaKs_Calculator 2.0 as an updated version.It incorporates 17 methods for the calculation of nonsynonymous and synonymous substitution rates;among them,we added our modified versions of several widely used methods as the gamma series including γ-NG,γ-LWL,γ-MLWL,γ-LPB,γ-MLPB,γ-YN and γ-MYN,which have been demonstrated to perform better under certain conditions than their original forms and are not implemented in the previous version.The package is readily used for the identification of positively selected sites based on a sliding window across the sequences of interests in 5' to 3' direction of protein-coding sequences,and have improved the overall performance on sequence analysis for evolution studies.A toolbox,including C++ and Java source code and executable files on both Windows and Linux platforms together with a user instruction,is downloadable from the website for academic purpose at https://sourceforge.net/projects/kakscalculator2/.  相似文献   

17.
Epistasis has been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for epistasis detection, genome-wide association study remains a challenging task: it makes the search space excessively huge while solution quality is excessively demanded. In this study, we introduce an ant colony optimization based algorithm, AntMiner, by incorporating heuristic information into ant-decision rules. The heuristic information is used to direct ants in the search process for improving computational efficiency and solution accuracy. During iterations, chi-squared test is conducted to measure the association between an interaction and the phenotype. At the completion of the iteration process, statistically significant epistatic interactions are ordered and then screened by a post-procedure. Experiments of AntMiner and its comparison with existing algorithms epiMODE, TEAM and AntEpiSeeker are performed on both simulation data sets and real age-related macular degeneration data set, under the criteria of detection power and sensitivity. Results demonstrate that AntMiner is promising for epistasis detection. In terms of detection power, AntMiner performs best among all the other algorithms on all cases regardless of epistasis models and single nucleotide polymorphism size; compared with AntEpiSeeker, AntMiner can obtain better detection power but with less ants and iterations. In terms of sensitivity, AntMiner is better than AntEpiSeeker in detecting epistasis models displaying marginal effects but it has moderate sensitivity on epistasis models displaying no marginal effects. The study may provide clues on heuristics for further epistasis detection. The software package is available online at https://sourceforge.net/projects/antminer/files/.  相似文献   

18.
Simple Sequence Repeats (SSR), also called microsatellite, is very useful for genetic marker development and genome application. The increasing whole sequences of more and more large genomes provide sources for SSR mining in silico. However currently existing SSR mining tools can’t process large genomes efficiently and generate no or poor statistics. Genome-wide Microsatellite Analyzing Tool (GMATo) is a novel tool for SSR mining and statistics at genome aspects. It is faster and more accurate than existed tools SSR Locator and MISA. If a DNA sequence was too long, it was chunked to short segments at several Mb followed by motifs generation and searching using Perl powerful pattern match function. Matched loci data from each chunk were then merged to produce final SSR loci information. Only one input file is required which contains raw fasta DNA sequences and output files in tabular format list all SSR loci information and statistical distribution at four classifications. GMATo was programmed in Java and Perl with both graphic and command line interface, either executable alone in platform independent manner with full parameters control. Software GMATo is a powerful tool for complete SSR characterization in genomes at any size.

Availability

The soft GMATo is freely available at http://sourceforge.net/projects/gmato/files/?source=navbar or on contact  相似文献   

19.

Background

Superpositioning is an important problem in structural biology. Determining an optimal superposition requires a one-to-one correspondence between the atoms of two proteins structures. However, in practice, some atoms are missing from their original structures. Current superposition implementations address the missing data crudely by ignoring such atoms from their structures.

Results

In this paper, we propose an effective method for superpositioning pairwise and multiple structures without sequence alignment. It is a two-stage procedure including data reduction and data registration.

Conclusions

Numerical experiments demonstrated that our method is effective and efficient. The code package of protein structure superposition method for addressing the cases with missing data is implemented by MATLAB, and it is freely available from: http://sourceforge.net/projects/pssm123/files/?source=navbar
  相似文献   

20.
Plant and animal genomes are replete with large gene families, making the task of ortholog identification difficult and labor intensive. OrthoRBH is an automated reciprocal blast pipeline tool enabling the rapid identification of specific gene families of interest in related species, streamlining the collection of homologs prior to downstream molecular evolutionary analysis. The efficacy of OrthoRBH is demonstrated with the identification of the 13-member PYR/PYL/RCAR gene family in Hordeum vulgare using Oryza sativa query sequences. OrthoRBH runs on the Linux command line and is freely available at SourceForge.

Availability

http://sourceforge.net/projects/ orthorbh/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号