Comments
Write a commentNo comments have been published yet.
In this paper, the authors present a dataset composed of TSV documents and a codebase that implements the methods used to create it, representing a reduced version of the OpenAIRE Graph. The goal of creating this dataset is to provide potential users with minimal bibliographic information about the publications included in OpenAIRE and their associated citations.
Some of the comments, addressing the primary points to keep in mind when reviewing this kind of paper, are presented as follows.
The paper presents a brief methodological section describing the overall activity undertaken to create the dataset and the related outputs. The reproducibility of the dataset is enabled by publishing the codebase used to create it, which allows the entire dataset-creation procedure to be reproduced. In addition, such a codebase permits the generation of new versions of the dataset starting from future dumps of the OpenAIRE Graph.
One aspect that should be detailed a bit more in the paper is its interlinking with the codebase, though. In particular, the steps should mention, while describing their actions, also the related Python file that implements it in the codebase. Similarly, the Quality control phase should also report the Python file used to conduct it and the type of report produced, to determine whether everything went well.
The dataset is correctly described.
This section introduces a brief paragraph outlining potential communities interested in reusing it for several distinct purposes. However, it would be important to showcase, in the paper and with more detail in the codebase, the basic set of instructions for loading and working with the dataset using Pandas, the primary Python library the authors recommend for working with the created data. In particular, it would be central to show how to upload it using the int32 representation, thus saving processing memory for its use.
The repository used (i.e. Zenodo) is appropriate.
The data are deposited in CC-BY. However, the software (that is included in the Zenodo dataset) should be made available using an appropriate open source software license (see https://opensource.org/licenses) to enable its appropriate reuse. Thus, in the Zenodo dataset, the license for the software should be included, and the Git repository should also include a "LICENSE" file with the related license description.
The formats used are all open.
The description provided in the dataset's description in Zenodo is complete, with sufficient detail to understand the content of the data. However, it is not clear if specific columns can contain more than one value in their cells, since for certain dimensions, that is possible (e.g. authors). In those cases, it would be important also to specify the format that the cells may have.
All software developed for creating the datasets has been included in the Zenodo record.
Not applicable.
There are some passages in the text and in the dataset that need to be clarified appropriately:
The claim that there are no works explaining the publication coverage between OpenAIRE and the other open datasets is not entirely true, since there are works (e.g. https://doi.org/10.1007/978-3-031-65794-8_11) that analysed this aspect. I would suggest extending the paper's literature review in this respect.
In the "Example of reduction of a relation" section, the authors show which information from the original data is preserved. However, they do not explain what information is discarded. This description is necessary to enable potential users to determine whether this discarded information may be useful for their purposes (and, in that case, the choice of using the original OpenAIRE Graph is necessary) or not (and thus the reduced version proposed by the authors is sufficient).
The field "doi" of the final table is delicate since, in the OpenAIRE Graph, it is entirely possible that more than one DOI is assigned to the same record. In these cases, what approach is adopted to choose which DOI to keep? Otherwise, if all DOIs are actually preserved, what is the format of that column used to specify more than one DOI?
What is the rationale adopted to avoid including other identifiers in the final table (PMID, arXiv Id, etc.)? These may be relevant data of fundamental importance for the dataset, in particular when we want to mash it up with information from other sources and need to run deduplication mechanisms based on persistent identifiers to recognise duplicate records. Using only DOI seems to privilege only those records that have been registered within DOI agencies - something that does not characterise all the publication records included in OpenAIRE (and in other open data providers), such as books, just to make an example. This is particularly relevant if we consider that the "openaireId" is not a persistent identifier and that, in subsequent OpenAIRE Graph dumps, the "openaireId" for a publication may change.
The author declares that they have no competing interests.
The author declares that they did not use generative AI to come up with new ideas for their review.
No comments have been published yet.