Escribir un comentario

PREreview del Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

por Sheli Francis

Publicado: 22 de diciembre de 2025
DOI: 10.5281/zenodo.18024121
Licencia: CC BY 4.0

Extended Summary

This paper offers a practical and well-structured framework for evaluating the safety and reliability of mental health chatbots powered by large language models. As artificial intelligence tools such as GPT-3.5 are increasingly used in mental health settings, there is a need for reliable methods to assess whether their advice is appropriate, safe, and responsible.

The authors created one hundred realistic mental health prompts based on common user queries. Each prompt was paired with an expert-crafted response that reflected safe and clinically sound guidance. These served as gold-standard comparisons. The same prompts were then given to GPT-3.5, and its responses were evaluated using five safety-focused criteria: alignment with clinical guidelines, recognition of risk, consistency in handling emergencies, offering relevant resources, and encouraging user agency.

To score the responses, the authors enlisted three licensed mental health professionals. Their ratings were treated as the ground truth. The paper then tested several automated scoring approaches to determine how well they could replicate expert evaluations. These included four different large language models (GPT-4, Claude, Gemini, and Mistral), two embedding-based similarity methods, and an Agent model that performed real-time web searches.

The comparison between human and automated ratings helped demonstrate both the potential and limitations of these evaluation methods. While some models performed reasonably well in approximating expert judgment, others were overly lenient or inconsistent, particularly in higher-risk scenarios. The authors note that their framework focuses strictly on safety and does not evaluate traits like empathy, trust-building, or user experience.

This study contributes a clear, replicable evaluation pipeline for future research and development in the mental health chatbot space. It helps establish benchmarks and opens the door for scalable, semi-automated testing of AI tools before they are deployed in sensitive settings.

Aim

The aim of this study was to develop a reproducible evaluation framework for assessing the safety of mental health chatbot responses, and to explore whether automated scoring methods can reliably match expert human evaluation.

Methods

The researchers created one hundred mental health-related prompts.
Expert-written ideal responses were crafted for each prompt.
GPT-3.5 was asked to generate responses to all one hundred prompts.
Responses were rated using five predefined safety dimensions:
1. Clinical guideline alignment
2. Risk awareness and handling
3. Consistency in emergencies
4. Helpful resource suggestions
5. Empowering the user
Three licensed mental health professionals independently rated the chatbot responses.
The expert scores were compared to those from:

Four large language models (GPT-4, Claude, Gemini, Mistral)
Two embedding-based similarity models
One Agent model using real-time internet search

Outcomes

The human expert scores served as the baseline for comparison.
Automated scorers varied in their agreement with human ratings.
The Agent model showed the highest correlation with expert scores.
Embedding-based models showed moderate agreement.
The large language model scorers tended to rate chatbot responses more positively than the expert reviewers.
The study confirmed that automated evaluation is possible but not fully reliable without human oversight.

Positive Feedback (Strengths and Results)

The methodology was carefully designed and grounded in real-world mental health concerns.
The creation of a realistic test set with expert reference answers strengthened the study’s credibility.
The five-part safety framework was clearly defined and consistently applied.
Comparing different scoring methods provides valuable insight for future research.

Minor Issues

Some expert-written reference answers did not include crisis support or escalation steps when it may have been appropriate.
The paper could benefit from examples of chatbot responses to show what qualified as unsafe or weak.
A flow diagram of the process may have helped readers better understand the evaluation pipeline.

Major Issues

There are no major issues that would prevent publication. The paper is well-designed, clearly written, and makes a valid contribution to the field. All claims are supported by appropriate methodology, and the authors clearly state the limits of their scope.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Puedes escribir un comentario en esta PREreview de Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un ORCID iD, puedes crear uno.