Saltar al contenido principal

Escribe una PREreview

Instruction-tuned extraction of virus-host interactions from integrated scientific evidence

Publicada
Servidor
bioRxiv
DOI
10.1101/2025.09.02.673691

Motivation

Viral infectious diseases continue to pose a major threat to global health. Understanding protein-protein interactions (PPIs) and RNA-protein interactions (RPIs) between viruses and hosts is essential for elucidating infection mechanisms. However, manual curation of these interactions from biomedical literature is inefficient, creating a pressing need for automated and scalable extraction methods. Large language models (LLMs), such as the generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), offer promising solutions. Yet, most existing datasets focus on abstracts, overlooking other information-rich sections. We aim to develop a data-efficient approach to extract virus-host interaction (VHI) entities from full-text biomedical articles, including Results, Methods and tables. To our knowledge, this is the first study to apply instruction tuning to full-text VHI extraction

Results

We curated a dataset containing 3, 395 PPI and 674 RPI entities from the Results, Materials and Methods sections, along with 566 PPIs and 793 RPIs from tables. Under low-resource conditions (<500 training examples), our instruction-tuned ChatMed-VHI model achieves the best overall performance (F1: 89.7%, Precision: 95.3%), outperforming PubMedBERT (F1: 74.6%, Precision: 75.1%). When scaled to the full dataset (>4, 000 training examples), ChatMed-VHI maintained the highest overall performance, while PubMedBERT achieved slightly higher precision (92.3% vs. 91.3%). Notably, ChatMed-VHI improved F1 and recall by 2.79% but precision dropped by 4.20% with more training data, whereas PubMedBERT improved consistently across all metrics. These results demonstrate the effectiveness of instruction-tuned LLMs for full-text biomedical extraction tasks, and position ChatMed-VHI as a scalable, domain-adaptable solution for VHI mining.

Puedes escribir una PREreview de Instruction-tuned extraction of virus-host interactions from integrated scientific evidence. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora