Write a PREreview

Evaluating Open-Source LLMs for Automated Essay Scoring: The Critical Role of Prompt Design

by Mahmoud Abujadallah, Motaz Saad, and Shadi Abudalfa

Posted: November 19, 2025
Server: Preprints.org
DOI: 10.20944/preprints202511.1429.v1

This paper evaluates the Automated Essay Scoring (AES) performance of five open-source Large Language Models (LLMs)—LLaMA 3.2 3B, DeepSeek-R1 7B, Mistral 8×7B, Qwen2 7B, and Qwen2.5 7B—on the PERSUADE 2.0 dataset. We assess each model under three distinct prompting strategies: (1) rubric-aligned prompting, which embeds detailed, human-readable definitions of each scoring dimension; (2) instruction-based prompting, which names the criteria and assigns a grading role without elaboration; and (3) a minimal instruction-based variant, which omits role priming and provides only a concise directive. All prompts constrain the output to a single numerical score (1–6) to ensure comparability.Performance is measured using standard AES metrics, including Exact Match, F1 Score, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson and Spearman correlation coefficients, and Cohen’s k. Results demonstrate that prompt design critically influences scoring accuracy and alignment with human judgments—with rubric-aligned prompting consistently outperforming instruction-based alternatives. Among the models, DeepSeek-R1 7B and Mistral 8×7B achieve the strongest overall results: DeepSeek-R1 attains the highest F1 Score (0.93), while Mistral 8×7B leads in correlation with human scores (Pearson = 0.863, Spearman = 0.831). Human comparison experiments further confirm that rubric-aligned prompting yields the closest alignment with expert graders.These findings underscore the potential of lightweight, open-source LLMs for reliable and equitable educational assessment, while highlighting explicit rubric integration—not model scale—as the key driver of human-aligned AES performance.

You can write a PREreview of Evaluating Open-Source LLMs for Automated Essay Scoring: The Critical Role of Prompt Design. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.