Instruction-tuned extraction of virus-host interactions from integrated scientific evidence
- Posted
- Server
- bioRxiv
- DOI
- 10.1101/2025.09.02.673691
Motivation
Viral infectious diseases continue to pose a major threat to global health. Understanding protein-protein interactions (PPIs) and RNA-protein interactions (RPIs) between viruses and hosts is essential for elucidating infection mechanisms. However, manual curation of these interactions from biomedical literature is inefficient, creating a pressing need for automated and scalable extraction methods. Large language models (LLMs), such as the generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), offer promising solutions. Yet, most existing datasets focus on abstracts, overlooking other information-rich sections. We aim to develop a data-efficient approach to extract virus-host interaction (VHI) entities from full-text biomedical articles, including Results, Methods and tables. To our knowledge, this is the first study to apply instruction tuning to full-text VHI extraction
Results
We curated a dataset containing 3, 395 PPI and 674 RPI entities from the Results, Materials and Methods sections, along with 566 PPIs and 793 RPIs from tables. Under low-resource conditions (<500 training examples), our instruction-tuned ChatMed-VHI model achieves the best overall performance (F1: 89.7%, Precision: 95.3%), outperforming PubMedBERT (F1: 74.6%, Precision: 75.1%). When scaled to the full dataset (>4, 000 training examples), ChatMed-VHI maintained the highest overall performance, while PubMedBERT achieved slightly higher precision (92.3% vs. 91.3%). Notably, ChatMed-VHI improved F1 and recall by 2.79% but precision dropped by 4.20% with more training data, whereas PubMedBERT improved consistently across all metrics. These results demonstrate the effectiveness of instruction-tuned LLMs for full-text biomedical extraction tasks, and position ChatMed-VHI as a scalable, domain-adaptable solution for VHI mining.