PREreview of PreprintToPaper dataset: connecting bioRxiv preprints with journal publications
- Published
- DOI
- 10.5281/zenodo.17625492
- License
- CC BY 4.0
This paper presents a data set of links between preprints on the bioRxiv preprint server and the corresponding articles published in journals. The paper also presents a small data set of links that have been manually checked by two human annotators. All data is openly available in Zenodo.
I have two important comments on the paper.
First, the authors claim that “this dataset is the first systematic effort to automatically collect and link metadata from bioRxiv with publication records”. I don’t think this is correct. Europe PMC provides similar data. See https://doi.org/10.1371/journal.pone.0303005. Dominika Tkaczyk at Crossref has also created a similar data set. See https://doi.org/10.64000/dpcc9-k4564 and https://community.crossref.org/t/new-version-of-the-dataset-of-relationships-between-preprints-and-journal-articles/13536. Work done by Peter Eckmann and Anita Bandrowski also appears to be similar. See https://doi.org/10.1371/journal.pone.0281659.
Second, the authors use the adjective ‘published’ for articles published in a journal and the adjective ‘unpublished’ for articles posted/published on a preprint server. While this terminology is also used by some other authors, I find it highly confusing. Preprints are publicly available articles, so by definition a preprint is a published article, not an unpublished article. Preprints are increasingly recognized as published works, for instance in the publish-review-curate model that is getting increasingly popular. In the publish-review-curate model, an article is first published and then peer reviewed. Publication takes place on a preprint server. My recommendation is to adopt a less confusing terminology. For instance, ‘published’ can be replaced by ‘published in a journal’, and ‘unpublished’ can be replaced by ‘posted on a preprint server’, or something along these lines.
A minor comment relates to the two criteria in step 1 in the methodology used by the authors. Consider a preprint that has three versions, one posted in 2021, one posted in 2022, and one posted in 2023. According to the two criteria in step 1, this preprint is not retained. However, a preprint with two versions, one posted in 2021 and one posted in 2023, is retained, based on the second criterion in step 1. I don’t understand why these two preprints are handled differently.
Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.