Skip to main content

Write a PREreview

Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

Posted
Server
Preprints.org
DOI
10.20944/preprints202511.1292.v1

Recent progress in large language models (LLMs) has rekindled the promise of high-quality machine translation (MT), yet evaluation remains a bottleneck. Traditional automatic metrics (e.g., BLEU) are fast but fail to capture semantic and pragmatic nuances reflected in human judgments. We present a multidimensional framework—inspired by MQM—that augments standard metrics (Adequacy, Fluency) with three linguistic dimensions: Morphosyntactic, Semantic, and Pragmatic. We compare three Small Language Models for English→Indonesian: Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B). Two controlled experiments are conducted: (i) Preliminary (1,000 translations, GPT-5-only scoring of Adequacy/Fluency + BLEU), and (ii) Final (100 translations, three human experts + GPT-5) on all five metrics. We compute inter-annotator reliability (Krippendorff’s α, weighted κ) and annotator competence (MACE). Results show consistent model ranking (Gemma 3 (1B) > LLaMA 3.2 (3B) > Qwen 3 (0.6B)) and strong GPT-5–human correlation (r = 0.822). To validate practical applicability, a classroom study with 26 translation students tested the metrics in real learning settings. Using the same multidimensional rubric, students rated MT outputs across pre-, post-, and final-test phases. Their mean absolute error (MAE) decreased from 0.97 to 0.83, while Exact Match Rate increased from 0.30 to 0.50 after rubric calibration, demonstrating that the proposed framework and GPT-5 evaluation can be effectively transferred to educational contexts for evaluator training and feedback alignment.

You can write a PREreview of Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now