Write a PREreview

Instruction-tuned extraction of virus-host interactions from integrated scientific evidence

by Zheng Zhang, Weichao Li, Jiajun Ren, Yufei Chen, Ben He, Qiang Sun, and Haibo Wang

Posted: September 6, 2025
Server: bioRxiv
DOI: 10.1101/2025.09.02.673691

Motivation

Viral infectious diseases continue to pose a major threat to global health. Understanding protein-protein interactions (PPIs) and RNA-protein interactions (RPIs) between viruses and hosts is essential for elucidating infection mechanisms. However, manual curation of these interactions from biomedical literature is inefficient, creating a pressing need for automated and scalable extraction methods. Large language models (LLMs), such as the generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), offer promising solutions. Yet, most existing datasets focus on abstracts, overlooking other information-rich sections. We aim to develop a data-efficient approach to extract virus-host interaction (VHI) entities from full-text biomedical articles, including Results, Methods and tables. To our knowledge, this is the first study to apply instruction tuning to full-text VHI extraction

Results

We curated a dataset containing 3, 395 PPI and 674 RPI entities from the Results, Materials and Methods sections, along with 566 PPIs and 793 RPIs from tables. Under low-resource conditions (<500 training examples), our instruction-tuned ChatMed-VHI model achieves the best overall performance (F1: 89.7%, Precision: 95.3%), outperforming PubMedBERT (F1: 74.6%, Precision: 75.1%). When scaled to the full dataset (>4, 000 training examples), ChatMed-VHI maintained the highest overall performance, while PubMedBERT achieved slightly higher precision (92.3% vs. 91.3%). Notably, ChatMed-VHI improved F1 and recall by 2.79% but precision dropped by 4.20% with more training data, whereas PubMedBERT improved consistently across all metrics. These results demonstrate the effectiveness of instruction-tuned LLMs for full-text biomedical extraction tasks, and position ChatMed-VHI as a scalable, domain-adaptable solution for VHI mining.

You can write a PREreview of Instruction-tuned extraction of virus-host interactions from integrated scientific evidence. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.