Skip to main content

Write a PREreview

Instruction-tuned extraction of virus-host interactions from integrated scientific evidence

Posted
Server
bioRxiv
DOI
10.1101/2025.09.02.673691

Motivation

Viral infectious diseases continue to pose a major threat to global health. Understanding protein-protein interactions (PPIs) and RNA-protein interactions (RPIs) between viruses and hosts is essential for elucidating infection mechanisms. However, manual curation of these interactions from biomedical literature is inefficient, creating a pressing need for automated and scalable extraction methods. Large language models (LLMs), such as the generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), offer promising solutions. Yet, most existing datasets focus on abstracts, overlooking other information-rich sections. We aim to develop a data-efficient approach to extract virus-host interaction (VHI) entities from full-text biomedical articles, including Results, Methods and tables. To our knowledge, this is the first study to apply instruction tuning to full-text VHI extraction

Results

We curated a dataset containing 3, 395 PPI and 674 RPI entities from the Results, Materials and Methods sections, along with 566 PPIs and 793 RPIs from tables. Under low-resource conditions (<500 training examples), our instruction-tuned ChatMed-VHI model achieves the best overall performance (F1: 89.7%, Precision: 95.3%), outperforming PubMedBERT (F1: 74.6%, Precision: 75.1%). When scaled to the full dataset (>4, 000 training examples), ChatMed-VHI maintained the highest overall performance, while PubMedBERT achieved slightly higher precision (92.3% vs. 91.3%). Notably, ChatMed-VHI improved F1 and recall by 2.79% but precision dropped by 4.20% with more training data, whereas PubMedBERT improved consistently across all metrics. These results demonstrate the effectiveness of instruction-tuned LLMs for full-text biomedical extraction tasks, and position ChatMed-VHI as a scalable, domain-adaptable solution for VHI mining.

You can write a PREreview of Instruction-tuned extraction of virus-host interactions from integrated scientific evidence. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now