Escribe una PREreview

Evaluating Open-Source LLMs for Automated Essay Scoring: The Critical Role of Prompt Design

por Mahmoud Abujadallah, Motaz Saad y Shadi Abudalfa

Publicada: 19 de noviembre de 2025
Servidor: Preprints.org
DOI: 10.20944/preprints202511.1429.v1

This paper evaluates the Automated Essay Scoring (AES) performance of five open-source Large Language Models (LLMs)—LLaMA 3.2 3B, DeepSeek-R1 7B, Mistral 8×7B, Qwen2 7B, and Qwen2.5 7B—on the PERSUADE 2.0 dataset. We assess each model under three distinct prompting strategies: (1) rubric-aligned prompting, which embeds detailed, human-readable definitions of each scoring dimension; (2) instruction-based prompting, which names the criteria and assigns a grading role without elaboration; and (3) a minimal instruction-based variant, which omits role priming and provides only a concise directive. All prompts constrain the output to a single numerical score (1–6) to ensure comparability.Performance is measured using standard AES metrics, including Exact Match, F1 Score, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson and Spearman correlation coefficients, and Cohen’s k. Results demonstrate that prompt design critically influences scoring accuracy and alignment with human judgments—with rubric-aligned prompting consistently outperforming instruction-based alternatives. Among the models, DeepSeek-R1 7B and Mistral 8×7B achieve the strongest overall results: DeepSeek-R1 attains the highest F1 Score (0.93), while Mistral 8×7B leads in correlation with human scores (Pearson = 0.863, Spearman = 0.831). Human comparison experiments further confirm that rubric-aligned prompting yields the closest alignment with expert graders.These findings underscore the potential of lightweight, open-source LLMs for reliable and equitable educational assessment, while highlighting explicit rubric integration—not model scale—as the key driver of human-aligned AES performance.

Puedes escribir una PREreview de Evaluating Open-Source LLMs for Automated Essay Scoring: The Critical Role of Prompt Design. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.