PREreview of Exploring the dynamics of external and self-citations and their role in shaping scientific impact

by Silvio Peroni

Published: May 31, 2025
DOI: 10.5281/zenodo.15564746
License: CC BY 4.0

In this article, the authors analyse the citation network extracted from DBLP to measure the preferential attachment rule (PAR) and understand its influence on citation distributions. In addition, they conduct a similar analysis considering the self-citations in the dataset and demonstrate that, in this context, PAR does not appear to be the primary criterion justifying the self-citation distribution. Some comments aimed at improving the article's content and clarifying certain passages are listed below.

Section Introduction

There is a mention about self-citations without explaining directly what they are. It would be appropriate to make some examples (that are present in the following part of the paper) since the beginning (author self-citation, journal self-citation, etc.), and clarify what kind of self-citation networks the present study (as well as those cited) are actually focused on (e.g. author self-citation).
The authors say that the citation network is usually scale-free. It would be good, for readers who do not already know what it means, to briefly explain what a scale-free network is.
This is where DBLP is first mentioned. DBLP is a well-known and curated dataset which contains information about Computer Science literature. However, the fact that it is mono-disciplinary is an aspect that should be highlighted in the paper from the beginning. Indeed, Computer Science may be governed by specific rules that may differ from those of other disciplines. In this respect, the title should also be revised by adding a particular reference to this domain, otherwise it may seem that the result defined here holds for the entire scholarly ecosystem, even if they have been analysed in a specific discipline.

Section Methodology

In the methodology section, the authors explicitly mention measuring when publications receive citations, and this is measured by looking at the publication date of the various citing and cited entities involved in citations. It is important to clarify, in particular in the dataset considered, what the publication date is. Indeed, in the current scholarly system, we do have several publication dates available (accepted, first online, published in issue, preprint publication, etc.), and it would be essential to understand to what dates the authors refer to, also to explain some of the situations that may happen (e.g. starting to accumulate more citation in the same year of the publication).
There is a clear explanation about what self-citations are, but it is too late in the paper, since the authors have already discussed that topic a lot previously. It would be crucial to move such an explanation before. In addition, among the possible kinds of self-citations, there are listed the coerced self-citations, which are indeed an essential issue in the scholarly ecosystem, and they are difficult to catch because often the reviewers' names are not disclosed. However, there are forms of peer review (e.g. non-anonymous peer reviews or, more generally, open peer review systems) that would decrease the issue in principle. It would be important to discuss a bit about this aspect where appropriate (e.g. in the discussion of the paper).
The DBLP dataset used, which focuses exclusively on Computer Science research, seems to be a bit old compared to the current status of the collection. In particular, currently (as of 31 May 2025), DBLP contains more than 7 million articles authored by more than 3.7 million authors that are manually curated (and disambiguated) by the DBLP team. For citation links, one could use a bunch of additional services (e.g. OpenCitations [disclosure: I am director of this infrastructure], OpenAIRE, OpenAlex) to retrieve them updated to date. Thus, my question is: why did the authors not use a more recent dataset?
It is not clear whether the citations used are only between citing and cited entities included in DBLP or if there are entities that are not included in DBLP for some reason. In this case, how did the author's disambiguation process work?
The authors mention the concept of external citation several times, but it is not entirely clear what it is. Is it a citation that is not a self-citation? Is it a citation that involves (either as citing or cited entity) a publication which is not in DBLP?

Section Results

It would be good to understand how much the papers/citations considered are distributed within the publication years of the dataset. In practice, can we have some basic overview of the dataset, e.g. which is the time frame of the publications represented there, which kinds of publications are considered (currently DBLP contains books, conference proceedings, editorials, informal publications such as preprints, journal articles, book chapters, and reference works).
The authors claim that many bibliometric indicators, while controversial, remain important factors in research assessments (e.g. promotions and funding allocation). However, currently the community around Open Science and research assessment is moving towards a more peer-review-oriented mechanism for research evaluation, which enables one to use metrics and indicators, but as a support for subjective decision-making (e.g. see CoARA, https://coara.eu/). It would be good to have some passages to contextualise the original authors' claim within this environment.
The claim that the 3DSI model appears to underestimate the influence of preferential attachment is not extremely evident from reading the related picture. Could the authors add a few lines of explanation?
The authors mention that the study was made possible through the use of the DBLP dataset, which has also been used in other studies. However, the dataset is clearly mono-disciplinary, being related to Computer Science research. Thus, I would avoid making general claims (i.e. associated with any scholarly discipline) based on it, since the way these dynamics work may differ significantly depending on the community of reference used. The results obtained surely apply to Computer Science, but they do not necessarily apply to, e.g., Social Science research or Philological research. Do the other relevant studies using it also make such general claims?
Informal passages such as "Let those without sin cast the first stone..." should be removed from the article.
In the additional experiment, the authors said that they have assumed that external citations are distributed according to the preferential attachment rule, based on the result obtained in their study with the DBLP dataset. However, such a hypothesis is applied to a dataset that includes publications in Nature that, even if it contains a few Computer Science papers, mainly focuses on different kinds of scholarly disciplines. Thus, the authors should provide a convincing argument to use such a hypothesis.

Sections Code Availability and Data Availability

Both code and data should be appropriately cited, not only mentioned. In particular, it would be essential to have accurate records in the reference section, possibly with a DOI or SWHID specified for the software developed for the study.

PREreview of Exploring the dynamics of external and self-citations and their role in shaping scientific impact

Section Introduction

Section Methodology

Section Results

Sections Code Availability and Data Availability

Competing interests

Comments