Assessing Translation Quality in Large Language Models and Machine Translation Systems: The BLEU Metric and Language Pair Effects
DOI:
https://doi.org/10.55549/epstem.1280Keywords:
Translation, Large language models, Machine translation systemsAbstract
The rapid advancement of translation technologies has transformed global communication and created new opportunities for cross-linguistic interaction. However, their relative quality compared to human translation remains contested. This paper provides a comparative evaluation of translations produced by large language models (ChatGPT, Claude, Copilot) and machine translation systems (DeepL, Google Translate, Yandex Translate). The study focuses on English–German and English–Russian translation tasks across four domains (literary, news, social, speech). The evaluation relies on the BLEU metric. Results demonstrate that translation quality depends not only on the system itself but also on the language pair and the text domain. Google Translate achieved the highest average BLEU score for German, while Claude led for Russian. Findings emphasize the need for multimethod evaluation approaches and highlight the growing competitiveness of AI-based systems depending on the type of text and language used in the translation process. This article is based on the master’s thesis “Evaluation of translation methods: Large Language Models and Machine Translation Systems versus Human Translation” (Kolchanka, 2025).
Downloads
Published
Issue
Section
License
Copyright (c) 2026 The Eurasia Proceedings of Science, Technology, Engineering and Mathematics

This work is licensed under a Creative Commons Attribution 4.0 International License.


