PREreview of Seeking community input for: mzPeak - a modern, scalable, and interoperable mass spectrometry data format for the future
- Published
- DOI
- 10.5281/zenodo.17297593
- License
- CC BY 4.0
I appreciate the vision laid out in this Preprint and the five comments made to date which demonstrate thoughtful progress toward mzPeak. As a member of LECO Corporation, which designs and manufactures time-of-flight mass spectrometers, I welcome the opportunity to contribute from a specific instrument-vendor perspective. I’ve captured some of my thoughts below, several of which overlap with previous comments.
Perhaps we can consider a repository where actual acquisition data files can be uploaded along with testing criteria. The diversity and complexity of testing criteria could grow over time. An early test might be verification of the sum of a “rectangle” (an m/z range by a chromatographic range). A report could show the test results by reader version. This would both support development of the reader’s ability to handle vendor-specific files and allow vendors to understand how compatible various versions of their own software are.
The testing suite could also contain tests for file writing. Each test could synthesize data and use the file writer to create a data file, followed by a reciprocal test to ensure the file contains the synthesized data as expected.
With respect to metadata, there is great value in designing a system where an older version of mzPeak can read, modify, and then write a data file originally created with a newer version. Tests could reveal how well this is supported. For example, the older writer should preserve data it does not understand (because it was created by a newer version).
The justification for this ability can be illustrated with a not-uncommon story: A user upgrades from version 1 to version 2 of a vendor’s software and processes data, taking advantage of version 2’s new annotation features. A month later, a discovered defect in version 2 forces the user to revert to version 1. Without hesitation, the user employs version 1 to modify files collected with version 2. Later, the vendor releases version 3, resolving the defect. The user upgrades from version 1 to version 3 and expects the annotations created in version 2 to still be intact.
Since performance will be an important aspect of any reader/writer system, tests could also include metrics for file size and wall-clock time on a standard configuration.
Beyond testing, there are complications that may apply specifically to LECO’s situation. For example, each spectrum in a TOF mass spectrometer data file can be represented by an array of intensities indexed by time-of-flight. Translating TOF to m/z is done with a function that may or may not change during acquisition. Sometimes this relationship is more complex. For instance, in cases where multiple overlapping spectra are intentionally sent to the detector to improve duty cycle and dynamic range (see: https://patents.google.com/patent/US20130048852A1/en). How should this be delivered to third-party systems that do not understand such complexity?
Much TOF-domain data falls below a noise threshold and can be handled with proven compression techniques. However, the empty space can be vast, especially in high-resolution instruments, and uncompressing hordes of zeros in memory seems archaic. Some LECO data systems skip archiving TOF profile data entirely and instead retain only significant ion events in tuples (TOF, intensity, and quality metrics). How can third-party systems interpret this when they assume profile spectra?
Most LECO mass spectrometers are coupled to chromatographs—often with a modulator for multi-dimensional chromatography, and sometimes with additional detectors such as flame ionization or spectral detectors. Metadata must capture chromatographic conditions, modulation details (trap time, duty cycle period—which may be irregular), additional detector data, and ambient data such as oven temperature and power supply readings.
Returning to compression, there are advantages when algorithms understand the texture of multi-dimensional chromatography data. Even in single-dimensional chromatography, a compression algorithm that leverages the nature of chromatographic peaks (e.g., relatively stable ion ratios) is useful.
Regarding acquisition-time metadata, we should consider how to consistently store information such as ionization type, autosampler sequences, chromatographic parameters, and auxiliary detector configurations. Should this reside within the same file, or should relationships be defined across multiple files? Do we take an informal approach to metadata, leaving each vendor free to define field names and schemas, or do we want something more than “it must be a valid Parquet file”.
Suppose we create a registry to establish a unified schema and naming convention. This could help us avoid unwieldy “coalesce” code. Beyond that, we could be intentional about how we describe our data. Seeing an existing name that represents “mass spectrometer acquisition completion date time UTC” would help us know that this field does not describe when the auto sampler injection took place.
Annotations raise further questions though. How do we store identified compounds, deconvolution proposals, library search results (and their parameters), quantification results, and so forth? Should these reside in related files or a single file? Is there value in keeping acquisition files immutable? So, then we need to wrestle with the matter of relationships between related samples — those differing only in ionization method, dilution factor, lot number, or replicates. How should supporting files for quantification or other studies be associated? Parquet alone does not solve this. Will we adopt Apache Iceberg or another approach?
I’ll conclude here, though many other considerations remain. Clearly, this is a beast of a never-ending endeavor. What can I do to help?
Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.