Skip to PREreview

PREreview of Seeking community input for: mzPeak - a modern, scalable, and interoperable mass spectrometry data format for the future

Published
DOI
10.5281/zenodo.14934154
License
CC BY 4.0

I appreciate the intent and effort involved in developing this document and proposed approaches. However, it seems that new file formats are being created often, and some of the proposed features of this format could be counterproductive for certain aims. To note my bias, I am a big fan of the mzMLb format.

The biggest difference between this and mzMLb appears to be the desire to collect as much metadata as possible. For a Thermo LC and MS, there are a variety of LC parameters collected and encoded in Thermo raw files (pump block pressures, flow meter pressure, column oven temperature, and more) as well as MS metadata (at least 257 chromatogram traces). But if I use a non-Thermo LC with a Thermo MS, those LC metadata are likely missing from the MS raw files, which complicates extraction. I think extraction of all the desired metadata from generated files across all MS vendors to be almost impossible, which weakens the argument of moving from “future-proof” mzMLb, since some metadata are already encoded in mzMLb.

The proposed storage of data in multiple files reminds me a lot of the Sciex instruments generating .wiff, .wiff2, and .wiff.scan files for a single run. Often, user error results in data repositories missing one or more of these file types, which results in the metadata or scan data being lost. I think the existing approach with a single file for mzML, mzXML, mzMLb, and other formats to be vastly superior to having separate files with the aim for faster processing.

Just my two cents as I would be happy to be proven wrong with this file format becoming very popular, but those two issues are sizeable.

Cheers,

Chris

Competing interests

The author declares that they have no competing interests.

Comments

Write a comment
  1. Comment by Samuel Wein

    Published
    License
    CC BY 4.0

    Hi Chris,

    Thanks for the review. I am also a big fan of mzMLb, and one of the technical implementations that we are looking at is just trying to extend that format to encompass the further metadata that we want to collect, specifically with providing the instrument vendors an explicit section to store vendor specific data that doesn’t nicely correspond to our controlled vocabulary.

    I’d also prefer to find a solution that works in a single file (or at the very least transparently packs multiple files into an archive on close). Your comments on the user error mirror what I have seen when trying to grab raw data out of Pride depositions, more files means more chances for user errors. We will keep that in mind in trying to balance speed versus complexity.

    Competing interests

    I am an author of this preprint.