Skip to PREreview

PREreview of All You Need Is Context: Clinician Evaluations of various iterations of a Large Language Model-Based First Aid Decision Support Tool in Ghana

Published
DOI
10.5281/zenodo.12958884
License
CC BY 4.0

RESEARCH QUESTIONS 

This study attempts to answer the following research questions:

  1. How well do five (5) distinct generalised Large Language Models (LLMs) perform, when combined with Prompt Engineering and Retrieval Augmented Generation (RAG) techniques, in providing first aid advice for medical emergencies in resource-constrained settings?

  2. Can context-specific prompts improve the relevance and suitability of LLM responses in clinical scenarios in Low-and Middle-Income Countries (LMICs) like Ghana?

  3. How do the evaluations of LLM-generated medical advice by clinicians (experts in managing medical emergencies in resource-constrained settings) compare with machine evaluations?

RESEARCH MAIN GOAL AND ITS IMPORTANCE

The main goal of this study was to evaluate the suitability and usefulness of distinct generalised LLMs, in combination with Prompt Engineering and RAG techniques, for clinical decision support in medical emergencies from resource-constrained settings in Ghana.

This study provides insights on the potential usefulness of LLMs in improving healthcare delivery systems by augmenting the limited financial, logistical, and human resources available in LMICs. It also helps us to understand how simple prompts that are context-specific can affect the performance of generalised LLMs, ensuring that the medical advice generated is not only accurate but also practical and relevant to the specific circumstances of patients. Meanwhile, the discrepancies between human and machine evaluations were highlighted in this study, to emphasise the need for human input in order to assess as well as utilise more sophisticated prompts in developing generalised LLM tools that can outperform state-of-the-art medical LLMs.

RESEARCH MAIN APPROACH AND WHAT THE AUTHORS DID TO ADDRESS RESEARCH QUESTIONS

The authors selected and tested top-performing LLMs {OpenAI's GPT-4 Turbo Preview through Assistant Application Programming Interface (API) and Chat Completions API, Gemini 1.5 Pro, and Claude Sonnet} based on their rankings on LMSYS Chatbox Arena Leaderboard, alongside a combination of Prompt Engineering and Retrieval Augmented Generation (RAG) techniques, to produce first aid responses for medical emergencies in resource-constrained settings of Ghana. They fine-tuned these model parameters for quick response determinations and designed context-specific prompts using two chunking approaches. They also utilised both the "CharacterTextSplitter '' tool derived from Langchain, and the "all-mpnet-base-v2" transformer model, sourced from HuggingFace model hub, to divide the text into chunks. 13 clinician evaluators from Ghana with average of 3 years working experience in resource-constrained settings then rated the responses using both quantitative and qualitative analyses performed on the collected data as follows: 50 responses that were generated from the 5 LLMs using 10 clinical scenarios were evaluated and ranked using two RAG approaches from two groups of same physicians; Approach 1: 30 responses generated by RAG in group 1 (ranked by all 13 physicians) were given "Overall score" using a 10-point Likert scale, with *0* representing "Totally Unsatisfactory and *10* "Totally Satisfactory ", Approach 2: The other 20 responses also generated by RAG in group 2 were evaluated by 8 of the 13 physicians using a more robust approach that was based on accuracy, conciseness, safety, and helpfulness, in addition to the 10-point Likert scale.

RESEARCH MAIN FINDINGS

The results of the quantitative analysis from this study show that Gemini 1.5 Pro combined with Prompt Engineering model (Response B) outperformed the other Large Language Models (LLMs) and their various combinations, in terms of accuracy, safety, and helpfulness, as well as having the overall best score of 7.8 on the 10-point Likert scale and with a mean ranking of 7.4. However, the use of a more sophisticated RAG approach (combination of Prompt Engineering and RAG) improved the performance of the other two models (GPT4-Turbo and Claude Sonnet) and were only better than Gemini 1.5 Pro in terms of conciseness.

The qualitative analysis indicate that clinicians mostly value Large Language Model (LLM) responses that were considered “satisfactory”, while “concise” and “QuickTransfer” responses also had significant occurrences, suggesting their importance. Although, there were occasional concerns about the accuracy of diagnosis in some areas as well as those where the responses were considered not concise enough. In general, these results emphasise the importance of developing LLM-based tools that can provide first aid advice to physicians in order to effectively communicate and respond to clinical scenarios such as medical emergencies, especially in resource-constrained settings of Low-and-Middle-Income Countries.

WHAT IS MOST INTERESTING ABOUT THE RESEARCH

What stands out as most interesting from this research is that it encourages and sensitises application developers in resource-constrained settings of Low-and-Middle-Income countries (LMICs) to develop cost-effective generalised LLMs, which are often more accessible to people in these settings, and that can perform at par or better than specialised medical LLMs using simpler techniques. An example provided in the study is the SnooCODE Red application being developed in Ghana.

RELATIONSHIP OF THE MANUSCRIPT TO PUBLISHED LITERATURE AND FUTURE RESEARCH

The study builds on existing literatures which demonstrate the potentials and applications of generalised LLMs in improving healthcare delivery, supporting clinical decision-making, and acting as virtual health assistant, among other roles. The results of these study also pave the way for several future research directions:

  1. It shows that while RAG can enhance the performance of generalised LLMs, improper implementation of it can nullify its benefits. Future research could explore advanced RAG techniques and more sophisticated Prompt Engineering to further improve the accuracy and usefulness of generalised LLMs.

  2. It also paves the way for the development of cost-effective LLMs for LMICs which are designed to meet the specific needs of people living in these settings.

  3. The integration of LLMs into healthcare delivery systems is undoubtedly crucial, however, this study reveals the discrepancies between human and machine evaluations of the performance of generalised LLMs especially in light of contextual scenarios. For example, clinicians familiar with working in rural settings were not satisfied with LLM responses that did not demonstrate a higher sense of urgency in the quick transfer of patients to nearby hospitals. Therefore, future research could develop better evaluation metrics that will improve the ability of LLMs to capture those context-specific scenarios which are applicable to rural areas with limited resources.

RESEARCH MAIN STRENGTH AND WEAKNESS

The main strength of this study is found in the context-specific evaluation of generalised LLMs in resource-constrained settings. Also, by using physicians familiar with local medical challenges in rural areas, this study provides a realistic assessment of the suitability and effectiveness of generalised LLMs in practical scenarios.

The main weakness is the limited scope and relatively small sample size, especially in terms of the number of physicians (13) involved and the number of clinical scenarios (10) evaluated. This may impact on the generalisability of their findings, because a larger sample size and more diverse group of evaluators might yield different results.

MAJOR ISSUES

Research Methodology

  1. From the abstract, the distinct generalised LLMs reportedly used could have simply been stated as 3 (GPT4-Turbo,Gemini 1.5 Pro, and Claude Sonnet) alongside moderate Prompt Engineering and RAG instead of the 5 mentioned, to align with the research methodology, results and interpretation.

  2. Although, it was clearly stated in this study that the data generated from Gemini 1.5 pro is not compatible with RAG for retrievals, due to the particular model tools and RAG approaches used. I have reasons to believe there are specific RAG models from Google website for AI developers which are compatible with Gemini tools for retrievals, that were not considered in this study (https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding). I think adopting suitable approaches for the implementation of RAG techniques that are designed for specific and individual LLMs should be paramount in this type of research. Testing Gemini 1.5 Pro using a suitable RAG model would have provided more insights into why and how the model outperformed the other LLMs. Additionally, doing this could have justified the reasons why RAG is being touted as a highly promising approach to improving the factuality, reasoning, and interpretability of LLMs outputs as clearly opined in this study.

  3. Response evaluation and ranking by physicians could have been more evenly distributed across all experimental data without favouring one RAG approach and LLM over another, because the explanations provided by the authors for their choice of methodology appeared to be biased, favouring certain LLMs over another. This is also possibly the reason why RAG approach 2 had a higher mean ranking score than RAG approach 1. This approach does not not address one of the questions this research is trying to answer, and at the same time influencing the outcome of the study. To eliminate this kind of bias, I would suggest that, since 50 responses were generated for the two RAG approaches, It would have been better if the responses were divided equally into two groups of 25 for each RAG approach, instead of 30 and 20. Also, all 13 physicians should have been involved in evaluating the data from both RAG approaches to provide a more robust and unbiased output. Finally, both rankings should be done by evaluators on the two RAG approaches, using the Likert scale and other parameters.

  4. In the methodology, it was mentioned that 20 responses were ranked based on “accuracy”, “conciseness, and “helpfulness”, whereas, an additional parameter “safety” was added to the results. The authors should clarify this.

  5. Link or source code for the analyses should be provided, to ensure the reproducibility and validity of this study.

  6. The presentation and interpretation of results need to be checked and properly screened for errors to avoid misconceptions, wrong conclusions, and breach of research integrity.

MINOR ISSUES

  1. In the introduction, “Low-and Low-Middle-Income countries (LMICs)” [2] should be Low-and-Middle-Income Countries (LLMCs).

Results

  1. In the results, under quantitative analysis, “Table 3 shows the 8 codes and their descriptions.” Table 3 should have been Table 4.

  2. Can specific names or examples of such resource-constrained settings in Ghana where these LLMs were tested be mentioned? This may add a little bit of credibility and context to this research.

  3. The title of Figure 3 has typographical errors, and it was so difficult to interpret the results.

  4. In Table 1, overall mean ranking score and standard deviation across the 3 models when rounded up to the nearest whole number was 7.0 and 1.0 respectively, as opposed to the 7.1 and 1.4 stated above the table. Also leaving the scores the way it is in the table in decimals shows a level of significance among the models.

  5. One parameter “Overall Score” is missing in the title of Table 3.

  6. Use numbers and not Roman figures to name tables. For example Table 4, not Table IV.

  7. Data in Table 3 are not recorded in the same number of decimal points. Majority of them were recorded to 1 decimal point while others were whole numbers.

  8. With the way the data in Figure 1 is presented, I find it difficult to understand.

  9. From the methodology, section G, under response evaluation and ranking, there is a typographical error in the determination of a 10-point Likert scale. 0 should represent “totally unsatisfactory” and 10 “totally satisfactory”.

RESEARCH LIMITATIONS

  1. Limited availability of Application Programming Interface (API) and difficulty of access in selecting suitable LLMs.

  2. Lack of computational resources such as advanced GPUs to run high-ranking open source medical LLMs.

  3. Lack of using appropriate RAG model and approach to potentially maximise the performance of the best-performing Gemini 1.5 pro model.

  4. The authors acknowledged that they could have evaluated a larger cohort of responses as well as utilise comprehensive evaluation framework.

RECOMMENDATION

I would recommend this interesting manuscript for publication and for others to read, provided that most major issues if not all, and the minor issues raised are addressed. Authors may find underlisted references helpful.

No conflict of interest.

REFERENCES

  • Fogel, A.L., Kvedar, J.C. Artificial intelligence powers digital medicine. npj Digital Med 1, 5 (2018). https://doi.org/10.1038/s41746-017-0012-2

  • Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25, 44–56 (2019). https://doi.org/10.1038/s41591-018-0300-7

  • Esteva, A., Robicquet, A., Ramsundar, B. et al. A guide to deep learning in healthcare. Nat Med 25, 24–29 (2019). https://doi.org/10.1038/s41591-018-0316-z

  • https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding

  • Junaid Bajwa, Usman Munir, Aditya Nori, Bryan Williams. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J. (2021) Jul; 8(2): e188–e194. doi: 10.7861/fhj.2021-0095

Competing interests

The author declares that they have no competing interests.