PREreview of The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

Published: May 5, 2025
DOI: 10.5281/zenodo.15339016
License: CC BY 4.0

This study investigates topological properties of prominent biomedical knowledge graphs and explores how these properties affect edge prediction performance from embedding models. The paper is well-written, although the heavy reliance on mathematical notation for explaining key concepts, with limited accompanying prose, places a high barrier to approachability.

Several aspects of the paper stand out as laudable including the sharing of a preprint; the availability of an open-source Python toolkit including corresponding docs with example analysis code and the API reference; and deposition of data and pre-processing code for each knowledge graph. The source code and datasets are well documented. The presentation of data and findings through figures and tables is thorough, aesthetic, and plentiful.

I expect this study to serve as a popular reference on topological metrics and analyses of common biomedical knowledge graphs. It also provides some interesting findings on the effect of topological properties and how they influence edge prediction, although more work is needed to produce actionable recommendations on how to address these issues and whether they reflect confounding, causality, or other correlation. However, these questions can be left to future work.

As an author of Hetionet, I am delighted to see the thorough third-party evaluation and comparison of Hetionet with other knowledge graphs.

Feedback

Is the complete analysis code available, i.e., code to run all experiments and produce all visualizations? If not, why not, and would it be possible to publish to further strengthen this work?

The "Knowledge Graph Topological Properties" section is perhaps the most critical section readers must understand to appreciate the methodology and findings. Currently, the mathematical notation is precise but difficult for many biomedical researchers to understand. Figures 1 and 2 are much more approachable, although the caption of Figure 1 could be more verbose. How should the reader conceptualize r versus r'?

Also helpful in this section would be more of a prose-based description of the edge topological patterns. The introduction touches on symmetry, inference, and composition, but not inverse. Without dictating exactly where this should appear in the manuscript, I am interested in a more in-depth discussion of the four edge patterns. Why and when do they occur in biomedical knowledge graphs? The symmetry and inverse patterns seem to primarily arise from data modeling decisions, while the inference and composition patterns appear to arise more from biological association. Are all symmetric relations inherently symmetric? Further discussion of several relation types that exhibit each of these patterns and what the relations mean in that context would help clear up confusion.

The study is mostly consistent in using "relation type" to refer to edge type and "relation" to refer to a specific edge. However, "head out-degree of same relation" and "tail in-degree of same relation" omit "type".

we observe an improved accuracy when the counterpart edge (e.g., the reverse edge for symmetric triples) has been seen during training.

If the presence of symmetric edges is purely a data modeling decision (i.e., two directed edges are used to represent a single underlying relationship that is undirected in nature), then isn't this indicative of improper train-test partitioning? I would imagine proper partitioning should include both directions of the symmetric edge in the same partition. Am I missing something?

A recent study, on which I am a coauthor, explores the probability of edge existence based on node degree and how this relates to edge prediction performance:

The probability of edge existence due to node degree: a baseline for network-based predictions Michael Zietz, Daniel Himmelstein, Kyle Kloster, Christopher Williams, Michael Nagle, Casey Greene GigaScience (2024-02-07) https://doi.org/gtcbks DOI: 10.1093/gigascience/giae001 · PMID: 38323677 · PMCID: PMC10848215

This study does not apply embedding-model-based predictions nor does it evaluate the same topological properties as the current work. But I did see potential relevance to the discussion of the drastic effect of node degree on prediction performance. I draw the attention of the authors to this study, but given my obvious conflict of interest, only offer this comment if the authors have not already considered this study for citation. The ultimate decision of the authors to incorporate any aspect of this reference in their own study will have no influence on any additional review or recommendation I make for the current work.

In conclusion, this study is an important work that combines topological patterns and embedding-based edge predictions into a novel framework and analysis with substantial reporting of helpful graphics for practitioners.

Signed, Dr. Daniel Himmelstein

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they used generative AI to come up with new ideas for their review.

Comments

Write a comment

No comments have been published yet.