PREreview del Making the complete OpenAIRE citation graph easily accessible through compact data representation

por Silvio Peroni

Publicado: 4 de marzo de 2026
DOI: 10.5281/zenodo.18869049
Licencia: CC BY 4.0

In this paper, the authors present a dataset composed of TSV documents and a codebase that implements the methods used to create it, representing a reduced version of the OpenAIRE Graph. The goal of creating this dataset is to provide potential users with minimal bibliographic information about the publications included in OpenAIRE and their associated citations.

Some of the comments, addressing the primary points to keep in mind when reviewing this kind of paper, are presented as follows.

The paper contents

1. The methods section of the paper must provide sufficient detail that a reader can understand how the dataset was created, and would within reason be able to recreate it.

The paper presents a brief methodological section describing the overall activity undertaken to create the dataset and the related outputs. The reproducibility of the dataset is enabled by publishing the codebase used to create it, which allows the entire dataset-creation procedure to be reproduced. In addition, such a codebase permits the generation of new versions of the dataset starting from future dumps of the OpenAIRE Graph.

One aspect that should be detailed a bit more in the paper is its interlinking with the codebase, though. In particular, the steps should mention, while describing their actions, also the related Python file that implements it in the codebase. Similarly, the Quality control phase should also report the Python file used to conduct it and the type of report produced, to determine whether everything went well.

2. The dataset must be correctly described.

The dataset is correctly described.

3. The reuse section must provide concrete and useful suggestions for reuse of the data.

This section introduces a brief paragraph outlining potential communities interested in reusing it for several distinct purposes. However, it would be important to showcase, in the paper and with more detail in the codebase, the basic set of instructions for loading and working with the dataset using Pandas, the primary Python library the authors recommend for working with the created data. In particular, it would be central to show how to upload it using the int32 representation, thus saving processing memory for its use.

The deposited data

1. The repository in which the data is deposited must be suitable for this subject and have a sustainability model (see our list of recommended repositories).

The repository used (i.e. Zenodo) is appropriate.

2. The data must be deposited under an open license that permits unrestricted access (e.g. CCO, CC-BY).

The data are deposited in CC-BY. However, the software (that is included in the Zenodo dataset) should be made available using an appropriate open source software license (see https://opensource.org/licenses) to enable its appropriate reuse. Thus, in the Zenodo dataset, the license for the software should be included, and the Git repository should also include a "LICENSE" file with the related license description.

3. The deposited data must include a version that is in an open, non-proprietary format.

The formats used are all open.

4. The deposited data must have been labelled in such a way that a 3rd party can make sense of it (e.g. sensible column headers, descriptions in a readme text file).

The description provided in the dataset's description in Zenodo is complete, with sufficient detail to understand the content of the data. However, it is not clear if specific columns can contain more than one value in their cells, since for certain dimensions, that is possible (e.g. authors). In those cases, it would be important also to specify the format that the cells may have.

5. The deposited data must be actionable - i.e. if a specific script or software is needed to interpret it, this should also be archived and accessible.

All software developed for creating the datasets has been included in the Zenodo record.

6. Research involving human subjects, human material, or human data, must have been performed in accordance with the Declaration of Helsinki. Where applicable, the studies must have been approved by an appropriate ethics committee. The identity of the research subject should be anonymised whenever possible. For research involving human subjects informed consent to participate in the study must be obtained from participants (or their legal guardian).

Not applicable.

Additional comment

There are some passages in the text and in the dataset that need to be clarified appropriately:

The claim that there are no works explaining the publication coverage between OpenAIRE and the other open datasets is not entirely true, since there are works (e.g. https://doi.org/10.1007/978-3-031-65794-8_11) that analysed this aspect. I would suggest extending the paper's literature review in this respect.
In the "Example of reduction of a relation" section, the authors show which information from the original data is preserved. However, they do not explain what information is discarded. This description is necessary to enable potential users to determine whether this discarded information may be useful for their purposes (and, in that case, the choice of using the original OpenAIRE Graph is necessary) or not (and thus the reduced version proposed by the authors is sufficient).
The field "doi" of the final table is delicate since, in the OpenAIRE Graph, it is entirely possible that more than one DOI is assigned to the same record. In these cases, what approach is adopted to choose which DOI to keep? Otherwise, if all DOIs are actually preserved, what is the format of that column used to specify more than one DOI?
What is the rationale adopted to avoid including other identifiers in the final table (PMID, arXiv Id, etc.)? These may be relevant data of fundamental importance for the dataset, in particular when we want to mash it up with information from other sources and need to run deduplication mechanisms based on persistent identifiers to recognise duplicate records. Using only DOI seems to privilege only those records that have been registered within DOI agencies - something that does not characterise all the publication records included in OpenAIRE (and in other open data providers), such as books, just to make an example. This is particularly relevant if we consider that the "openaireId" is not a persistent identifier and that, in subsequent OpenAIRE Graph dumps, the "openaireId" for a publication may change.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Comentarios

Escribir un comentario

No se han publicado comentarios aún.