Assessing Translation Quality in Large Language Models and Machine Translation Systems: The BLEU Metric and Language Pair Effects

Authors

  • Philipp Rosenberger University of Applied Science Campus Vienna Author
  • Natallia Kolchanka University of Applied Science Campus Vienna Author

DOI:

https://doi.org/10.55549/epstem.1280

Keywords:

Translation, Large language models, Machine translation systems

Abstract

The rapid advancement of translation technologies has transformed global communication and created new opportunities for cross-linguistic interaction. However, their relative quality compared to human translation remains contested. This paper provides a comparative evaluation of translations produced by large language models (ChatGPT, Claude, Copilot) and machine translation systems (DeepL, Google Translate, Yandex Translate). The study focuses on English–German and English–Russian translation tasks across four domains (literary, news, social, speech). The evaluation relies on the BLEU metric. Results demonstrate that translation quality depends not only on the system itself but also on the language pair and the text domain. Google Translate achieved the highest average BLEU score for German, while Claude led for Russian. Findings emphasize the need for multimethod evaluation approaches and highlight the growing competitiveness of AI-based systems depending on the type of text and language used in the translation process. This article is based on the master’s thesis “Evaluation of translation methods: Large Language Models and Machine Translation Systems versus Human Translation” (Kolchanka, 2025).

Downloads

Published

2025-12-30

How to Cite

Assessing Translation Quality in Large Language Models and Machine Translation Systems: The BLEU Metric and Language Pair Effects. (2025). The Eurasia Proceedings of Science, Technology, Engineering and Mathematics, 38, 824-829. https://doi.org/10.55549/epstem.1280