The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R.We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.Mass spectrometry (MS)
1 has become a major analysis tool in the life sciences (
1). It is currently used in different modes for several “omics” approaches, proteomics and metabolomics being the most prominent. In both disciplines, one major burden in the exchange, communication, and large-scale (re-) analysis of MS-based data is the significant number of software pipelines and, consequently, heterogeneous file formats used to process, analyze, and store these experimental results, including both identification and quantification data. Publication guidelines from scientific journals and funding agencies'' requirements for public data availability have led to an increasing amount of MS-based proteomics and metabolomics data being submitted to public repositories, such as those of the ProteomeXchange consortium (
2) or, in the case of metabolomics, the resources from the nascent COSMOS (Coordination of Standards in Metabolomics) initiative (
3).In the past few years, the Human Proteome Organization Proteomics Standards Initiative (PSI) has developed several vendor-neutral standard data formats to overcome the representation heterogeneity. The Human Proteome Organization PSI promotes the usage of three XML file formats to fully report the data coming from MS-based proteomics experiments (including related metadata): mzML (
4) to store the “primary” MS data (the spectra and chromatograms), mzIdentML (
5) to report peptide identifications and inferred protein identifications, and mzQuantML (
6) to store quantitative information associated with these results.Even though the existence of the PSI standard data formats represents a huge step forward, these formats cannot address all use cases related to proteomics and metabolomics data exchange and sharing equally well. During the development of mzML, mzIdentML, and mzQuantML, the main focus lay on providing an exact and comprehensive representation of the gathered results. All three formats can be used within analysis pipelines and as interchange formats between independent analysis tools. It is thus vital that these formats be capable of storing the full data and analysis that led to the results. Therefore, all three formats result in relatively complex schemas, a clear necessity for adequate representation of the complexity found in MS-based data.An inevitable drawback of this approach is that data consumers can find it difficult to quickly retrieve the required information. Several application programming interfaces (APIs) have been developed to simplify software development based on these formats (
7–
9), but profound proteomics and bioinformatics knowledge still is required in order to use them efficiently and take full advantage of the comprehensive information contained.The new file format presented here, mzTab, aims to describe the qualitative and quantitative results for MS-based proteomics and metabolomics experiments in a consistent, simpler tabular format, abstracting from the mass spectrometry details. The format contains identifications, basic quantitative information, and related metadata. With mzTab''s flexible design, it is possible to report results at different levels ranging from a simple summary or subset of the complete information (
e.g. the final results) to fairly comprehensive representation of the results including the experimental design. Many downstream analysis use cases are only concerned with the final results of an experiment in an easily accessible format that is compatible with tools such as Microsoft Excel® or R (
10) and can easily be adapted by existing bioinformatics tools. Therefore, mzTab is ideally suited to make MS proteomics and metabolomics results available to the wider biological community, beyond the field of MS.mzTab follows a similar philosophy as the other tab-delimited format recently developed by the PSI to represent molecular interaction data, MITAB (
11). MITAB is a simpler tab-delimited format, whereas PSI-MI XML (
12), the more detailed XML-based format, holds the complete evidence. The microarray community makes wide use of the format MAGE-TAB (
13), another example of such a solution that can cover the main use cases and, for the sake of simplicity, is often preferred to the XML standard format MAGE-ML (
14). Additionally, in MS-based proteomics, several software packages, such as Mascot (
15), OMSSA (
16), MaxQuant (
17), OpenMS/TOPP (
18,
19), and SpectraST (
20), also support the export of their results in a tab-delimited format next to a more complete and complex default format. These simple formats do not contain the complete information but are nevertheless sufficient for the most frequent use cases.mzTab has been designed with the same purpose in mind. It can be used alone or in conjunction with mzML (or other related MS data formats such as mzXML (
21) or text-based peak list formats such as MGF), mzIdentML, and/or mzQuantML. Several highly successful concepts taken from the development process of mzIdentML and mzQuantML were adapted to the text-based nature of mzTab.In addition, there is a trend to perform more integrated experimental workflows involving both proteomics and metabolomics data. Thus, we developed a standard format that can represent both types of information in a single file.
相似文献