Comments
Write a comment-
Comment by Steffen Heuckeroth
- Published
- License
- CC BY 4.0
Thanks to the authors for coming up with this suggestion. I appreciate the effort to harmonize the data storage across instrument vendors in a compressed format. I have some thoughts on the topic that may be useful to think about:
I am wondering if you already have confirmed support from the instrument vendors? I think what makes proprietary formats good for them is that they can change and adapt them easily, if they introduce a new technique, analyser, and so on. In an open format that is controlled by CVparams like mzml, a new technique may lead to new vocabulary being introduced. This would be way ahead of market release of the new instrument and thus potentially give away their innovation to others. If i was a vendor, i would be reluctant of that.
I agree with Yasin’s comment that having metadata as free text would not be beneficial. I know that the preprint did not make any specific statement in what way the metadata would be included, but it should be noted that an undefined text provides no real benefit, because everybody would define, write, and parse it differently. (E.g. the ‘comments’ field in msp libraries or any community generated msp/mgf file)
Spectra and chromatograms from other detectors should be included in the mzPeak data (as stated by the authors). In addition, it might be useful to optionally add additional traces (e.g. pump pressures, charge aerosol detector, ELSD, …) from external files, such as .csv files. In that case, the comment of non-matching instrument setups (e.g. non-Thermo LC with Orbitrap) may be adressed.
Anyway, I am looking forward to any future news on this topic and thank you for the effort!
Best
Steffen
Competing interests
The author of this comment declares that they have no competing interests.
-
Comment by Samuel Wein
- Published
- License
- CC BY 4.0
Hi Yasin,
Thank you for your review. Here’s where we are currently with regards to your questions:
“I don’t see any discussion of if/how the diversity of vendor formats is addressed. For example, while Thermo numbers scans sequentially Sciex applies a different scan numbering system utilizing 3 different numbers.”
“Will there be the same flexibility as in mzML files to account for this?”
We plan on supporting at minimum the same flexibility as mzML for scans. The idea is to support an arbitrary number of dimensions for in addition to an M/Z to intensity pair. This will allow both ion-mobility and imagining, and has the ancillary benefit of supporting scans numbered along an arbitrary number of coordinates.
“Will there be a way to have some superimposed identifier, allowing to retrieve scans based on a scan order irrespective of the numbering system native to the original Vendor?”
That’s a good idea. I think we should get that out of storing arbitrary dimensions: one of those can be actual acquisition time. I’d need to discuss with the technical folks how that would work.
“It is great that this format aims to support metadata relating to hyphenated instruments such as LC pump pressures as well as experimental metadata. One concern I have here is that it sounds like this information will be stored as free text. Will there be dedicated keys to communicate used instruments and/or experimental setups via controlled ontologies?”
Yup. We are going to continue to use the PSI-ms-CV (https://github.com/HUPO-PSI/psi-ms-CV)
“One of the main challenges when working with mass spectral raw data as a data scientist is that such raw files have to be viewed in relationship to each other (for example samples are run within a sequence and can then be aligned using RT and mz). However such information is often lost once raw data have been stored in a repository. Will there be a key that serves for a sequence identifier?”
This is one of the main areas that we want to add metadata beyond what is already typically stored for mzML. We are looking to SDRF-Proteomics as a blueprint for this (https://github.com/bigbio/proteomics-sample-metadata/blob/master/sdrf-proteomics/README.adoc)
“Will there be a dedicated key to store a hashed unique identifier - that can be used to deduplicate different open formats of the same raw vendor file?”
This is a good idea. Storing a Universal Spectrum Identifier (https://www.psidev.info/usi) is supported in the PSI ontology in mzML already. We can definitely discuss the possibility of including a hash as well
“While all mass spectral raw data are originally acquired in profile mode the first thing that usually happens during conversion is to centroid mass peaks to simplify processing and optimize storage space. While this choice is practical — and I don’t see this changing in the near future — I think it would be great if the average mass resolution (which can be easily derived from profile raw data) could be stored — maybe per scan in the scan header in a dedicated key.”
I think this is definitely something that we could add as an optional annotation for a peak. I’ll bring it up with the group.
Please let me know if you have any other questions or suggestions.
Sam
Competing interests
I am one of the authors on the preprint.