PREreview of LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations
- Published
- License
- CC BY 4.0
Overview
The authors assemble the LSD600 corpus of 600 abstracts, where 324 abstracts describe relations between lifestyle factors and diseases (LS-Ds, hence LSDs). The remaining 276 abstracts appeared upon initial selection to potentially contain LSD relations according to their presence in LSF200 and selection by an automated named entity recognition "Tagger". The LSD mentions and relations were manually annotated according to 8 predefined relation types.
The major contributions of the work include the manual annotation of these 600 abstracts, the defining of a LSD relation type hierarchy, and the training of a RoBERTa-based language model for relation extraction.
The LSD dataset will likely be most useful as a resource to train and evaluate more scalable approaches. The authors share their annotations and relation extraction model under permissive open licenses. This work is a timely contribution to a burgeoning field. The obvious next steps are applying the model on all relevant abstracts or accessible full texts as well as grounding diseases and lifestyle factors to controlled vocabularies.
Suggestions
I opened GitHub issues for any suggestions that could involve code or data revisions and additions. The authors have begun addressing some of these requests. I note them below for completeness.
EsmaeilNourani/lifestylefactors-annotation-docs#2 requests a table of the 600 abstracts included in the corpus and several metadata fields. This table allows viewers to easily browse which abstracts are included along with the number of annotated lifestyle factors, diseases, and relations.
EsmaeilNourani/lifestylefactors-annotation-docs#1 requests a table of the 1900 manually annotated relations. This table is the best resource for a reader to easily familiarize themselves with the relationship set comprising the resource.
EsmaeilNourani/lifestylefactors-annotation-docs#3 notes some small but glaring inconsistencies in relation type labels and capitalization.
The "Manual annotation process and corpus evaluation" section discusses some details of the manual curation, including the inter-annotator agreement experiment. Since the curation is a major part of the study, further details on the entire curation task would be helpful. For example, which authors performed the annotations, and how many did each do? Were annotators assigned at the abstract level?
If I close the above referenced GitHub Issues, the editorial staff can consider that an acknowledgment that the suggestions have been adequitely addressed.
Comment
For pubmed:32004098, cocaine has 3 different relations with liver fibrosis: Statistically_associated positive_statistical_association, and NO_statistical_association. I believe these entity mentions are coming from the following snippet:
> No significant association was noted among HIV seronegative participants for liver fibrosis by sex differences or cocaine use. Among African Americans living with HIV, cocaine users were 1.68 times more likely to have liver fibrosis than cocaine nonusers (p = 0.044). Conclusions: Sex differences and cocaine use appear to affect liver disease among African Americans living with HIV pointing to the importance of identifying at-risk individuals to improve outcomes of liver disease.
I believe the annotation is correct, and no action is needed here. I point this out just as an interesting occurrence that highlights the challenge of aggregating textual relations into knowledge/facts.
Competing interests
The author declares that they have no competing interests.