首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Motivation

In mass spectrometry-based proteomics, XML formats such as mzML and mzXML provide an open and standardized way to store and exchange the raw data (spectra and chromatograms) of mass spectrometric experiments. These file formats are being used by a multitude of open-source and cross-platform tools which allow the proteomics community to access algorithms in a vendor-independent fashion and perform transparent and reproducible data analysis. Recent improvements in mass spectrometry instrumentation have increased the data size produced in a single LC-MS/MS measurement and put substantial strain on open-source tools, particularly those that are not equipped to deal with XML data files that reach dozens of gigabytes in size.

Results

Here we present a fast and versatile parsing library for mass spectrometric XML formats available in C++ and Python, based on the mature OpenMS software framework. Our library implements an API for obtaining spectra and chromatograms under memory constraints using random access or sequential access functions, allowing users to process datasets that are much larger than system memory. For fast access to the raw data structures, small XML files can also be completely loaded into memory. In addition, we have improved the parsing speed of the core mzML module by over 4-fold (compared to OpenMS 1.11), making our library suitable for a wide variety of algorithms that need fast access to dozens of gigabytes of raw mass spectrometric data.

Availability

Our C++ and Python implementations are available for the Linux, Mac, and Windows operating systems. All proposed modifications to the OpenMS code have been merged into the OpenMS mainline codebase and are available to the community at https://github.com/OpenMS/OpenMS.  相似文献   

2.
Shotgun proteomics workflows for database protein identification typically include a combination of search engines and postsearch validation software based mostly on machine learning algorithms. Here, a new postsearch validation tool called Scavager employing CatBoost, an open‐source gradient boosting library, which shows improved efficiency compared with the other popular algorithms, such as Percolator, PeptideProphet, and Q‐ranker, is presented. The comparison is done using multiple data sets and search engines, including MSGF+, MSFragger, X!Tandem, Comet, and recently introduced IdentiPy. Implemented in Python programming language, Scavager is open‐source and freely available at https://bitbucket.org/markmipt/scavager .  相似文献   

3.
4.
Open source clustering software   总被引:20,自引:0,他引:20  
  相似文献   

5.
《Genomics》2022,114(2):110264
Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.  相似文献   

6.
The continued evolution of modern mass spectrometry instrumentation and associated methods represents a critical component in efforts to decipher the molecular mechanisms which underlie normal physiology and understand how dysregulation of biological pathways contributes to human disease. The increasing scale of these experiments combined with the technological diversity of mass spectrometers presents several challenges for community‐wide data access, analysis, and distribution. Here we detail a redesigned version of multiplierz, our Python software library which leverages our common application programming interface (mzAPI) for analysis and distribution of proteomic data. New features include support for a wider range of native mass spectrometry file types, interfaces to additional database search engines, compatibility with new reporting formats, and high‐level tools to perform post‐search proteomic analyses. A GUI desktop environment, mzDesktop, provides access to multiplierz functionality through a user friendly interface. multiplierz is available for download from: https://github.com/BlaisProteomics/multiplierz ; and mzDesktop is available for download from: https://sourceforge.net/projects/multiplierz/  相似文献   

7.
A C++ class library is available to facilitate the implementation of software for genomics and sequence polymorphism analysis. The library implements methods for data manipulation and the calculation of several statistics commonly used to analyze SNP data. The object-oriented design of the library is intended to be extensible, allowing users to design custom classes for their own needs. In addition, routines are provided to process samples generated by a widely used coalescent simulation. AVAILABILITY: The source code (in C++) is available from http://www.molpopgen.org  相似文献   

8.
Protein identification by MS is an important technique in both gel‐based and gel‐free proteome studies. The Open Mass Spectrometry Search Algorithm (OMSSA) ( http://pubchem.ncbi.nlm.nih.gov/omssa ) is an open‐source search engine that can be used to identify MS/MS spectra acquired in these experiments. We here present a lightweight, open‐source Java software library, OMSSA Parser ( http://code.google.com/p/omssa‐parser ), which parses OMSSA omx result files into easy accessible and fully functional object models. In addition, we also provide examples illustrating the usage of our library.  相似文献   

9.
Genetic samples can be used to understand and predict the behaviour of species living in a fragmented and temporally changing environment. In this regard, models of coalescence conditioned to an environment through an explicit modelling of population growth and migration have been developed in recent years, and simulators implementing these models have been developed, enabling biologists to estimate parameters of interest with Approximate Bayesian Computation techniques. However, model choice remains limited, and developing new coalescence simulators is extremely time consuming because code re‐use is limited. We present Quetzal, a C++ library composed of re‐usable components, which is sufficiently general to efficiently implement a wide range of spatially explicit coalescence‐based environmental models of population genetics and to embed the simulation in an Approximate Bayesian Computation framework. Quetzal is not a simulation program, but a toolbox for programming simulators aimed at the community of scientific coders and research software engineers in molecular ecology and phylogeography. This new code resource is open‐source and available at https://becheler.github.io/pages/quetzal.html along with other documentation resources.  相似文献   

10.
SUMMARY: ESS++ is a C++ implementation of a fully Bayesian variable selection approach for single and multiple response linear regression. ESS++ works well both when the number of observations is larger than the number of predictors and in the 'large p, small n' case. In the current version, ESS++ can handle several hundred observations, thousands of predictors and a few responses simultaneously. The core engine of ESS++ for the selection of relevant predictors is based on Evolutionary Monte Carlo. Our implementation is open source, allowing community-based alterations and improvements. AVAILABILITY: C++ source code and documentation including compilation instructions are available under GNU licence at http://bgx.org.uk/software/ESS.html.  相似文献   

11.
SIMLR (S ingle‐cell I nterpretation via M ulti‐kernel L eaR ning), an open‐source tool that implements a novel framework to learn a sample‐to‐sample similarity measure from expression data observed for heterogenous samples, is presented here. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state‐of‐the‐art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. SIMLR is available on https://github.com/BatzoglouLabSU/SIMLR GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org  相似文献   

12.
TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders   总被引:5,自引:0,他引:5  
We describe two new Generalized Hidden Markov Model implementations for ab initio eukaryotic gene prediction. The C/C++ source code for both is available as open source and is highly reusable due to their modular and extensible architectures. Unlike most of the currently available gene-finders, the programs are re-trainable by the end user. They are also re-configurable and include several types of probabilistic submodels which can be independently combined, such as Maximal Dependence Decomposition trees and interpolated Markov models. Both programs have been used at TIGR for the annotation of the Aspergillus fumigatus and Toxoplasma gondii genomes. AVAILABILITY: Source code and documentation are available under the open source Artistic License from http://www.tigr.org/software/pirate  相似文献   

13.
Identification of proteins by MS plays an important role in proteomics. A crucial step concerns the identification of peptides from MS/MS spectra. The X!Tandem Project ( http://www.thegpm.org/tandem ) supplies an open‐source search engine for this purpose. In this study, we present an open‐source Java library called XTandem Parser that parses X!Tandem XML result files into an easily accessible and fully functional object model ( http://xtandem‐parser.googlecode.com ). In addition, a graphical user interface is provided that functions as a usage example and an end‐user visualization tool.  相似文献   

14.
15.
It is important to easily and efficiently obtain high quality species distribution data for predicting the potential distribution of species using species distribution models (SDMs). There is a need for a powerful software tool to automatically or semi-automatically assist in identifying and correcting errors. Here, we use Python to develop a web-based software tool (SDMdata) to easily collect occurrence data from the Global Biodiversity Information Facility (GBIF) and check species names and the accuracy of coordinates (latitude and longitude). It is an open source software (GNU Affero General Public License/AGPL licensed) allowing anyone to access and manipulate the source code. SDMdata is available online free of charge from <http://www.sdmserialsoftware.org/sdmdata/>.  相似文献   

16.
Various types of unwanted and uncontrollable signal variations in MS‐based metabolomics and proteomics datasets severely disturb the accuracies of metabolite and protein profiling. Therefore, pooled quality control (QC) samples are often employed in quality management processes, which are indispensable to the success of metabolomics and proteomics experiments, especially in high‐throughput cases and long‐term projects. However, data consistency and QC sample stability are still difficult to guarantee because of the experimental operation complexity and differences between experimenters. To make things worse, numerous proteomics projects do not take QC samples into consideration at the beginning of experimental design. Herein, a powerful and interactive web‐based software, named pseudoQC, is presented to simulate QC sample data for actual metabolomics and proteomics datasets using four different machine learning‐based regression methods. The simulated data are used for correction and normalization of the two published datasets, and the obtained results suggest that nonlinear regression methods perform better than linear ones. Additionally, the above software is available as a web‐based graphical user interface and can be utilized by scientists without a bioinformatics background. pseudoQC is open‐source software and freely available at https://www.omicsolution.org/wukong/pseudoQC/ .  相似文献   

17.
18.
19.
Retraction: Kamath PR, Joseph MM, Ajees AA, et al. Bisindole‐oxadiazole hybrids, T3P mediated synthesis and appraisal of their apoptotic, antimetastatic and computational Bcl‐2 binding potential. J Biochem Mol Toxicol . 2017;31:e21962. https://doi.org/10.1002/jbt.21962 The above article from the Journal of Biochemical and Molecular Toxicology , published online on 19 July 2017 in Wiley Online Library ( https://onlinelibrary.wiley.com/doi/abs/10.1002/jbt.21962 ) and in Volume 31, Issue 11, has been retracted by agreement of the Journal Editor‐in‐Chief, Dr Hari Bhat, and Wiley Periodicals, Inc. The retraction has been agreed due to the absence of access to the original data needed to answer questions about the reliability of some of the findings presented in the paper.  相似文献   

20.
The availability of user‐friendly software to annotate biological datasets and experimental details is becoming essential in data management practices, both in local storage systems and in public databases. The Ontology Lookup Service (OLS, http://www.ebi.ac.uk/ols ) is a popular centralized service to query, browse and navigate biomedical ontologies and controlled vocabularies. Recently, the OLS framework has been completely redeveloped (version 3.0), including enhancements in the data model, like the added support for Web Ontology Language based ontologies, among many other improvements. However, the new OLS is not backwards compatible and new software tools are needed to enable access to this widely used framework now that the previous version is no longer available. We here present the OLS Client as a free, open‐source Java library to retrieve information from the new version of the OLS. It enables rapid tool creation by providing a robust, pluggable programming interface and common data model to programmatically access the OLS. The library has already been integrated and is routinely used by several bioinformatics resources and related data annotation tools. Secondly, we also introduce an updated version of the OLS Dialog (version 2.0), a Java graphical user interface that can be easily plugged into Java desktop applications to access the OLS. The software and related documentation are freely available at https://github.com/PRIDE-Utilities/ols-client and https://github.com/PRIDE-Toolsuite/ols-dialog .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号