LLM's are not good for translation, "yet"

In the past year, LLMs have taken huge strides in language generation and translation tasks. And yet, when it comes to translating literature, poetry, novels, short stories, they’re still struggling to match human translators. I recently read the paper “How Good Are LLMs for Literary Translation, Really?” by Ran Zhang, Wei Zhao, and Steffen Eger, which dives deep into this issue. It’s a highly detailed and rigorous study that goes beyond flashy benchmarks and really asks: how good are LLMs actually at literary translation?

Here’s what stood out to me especially from the perspective of someone building AI systems and thinking critically about where LLMs shine and where we still need human judgment.

The Unique Challenge of Literary Translation

Translating literature isn’t just about converting words, it’s about capturing emotion, tone, cultural nuance, and the author’s voice. You’re not translating instructions or FAQs. You’re translating metaphors, ambiguity, rhythm, style.

This is why standard sentence-level metrics or simple word overlap scores fall flat. You might be able to match a sentence syntactically and still butcher the meaning, mood, or artistry.

The paper takes this challenge head-on and proposes a better way to evaluate literary machine translation (MT): a paragraph-level corpus called LITEVAL-CORPUS.

LITEVAL-CORPUS: A Serious Dataset for Serious Evaluation

One of the big contributions in this paper is the introduction of LITEVAL-CORPUS, a new benchmark corpus specifically for literary translation evaluation. Here’s why it matters:

✅ Paragraph-level focus (instead of sentence-level), because that’s the smallest unit where literary style really emerges.
✅ Includes verified human translations from published works, both classic and contemporary.
✅ Covers 4 language pairs, with over 13,000 evaluated sentences from 9 different MT systems.

This isn’t just a toy dataset. It’s serious work designed to challenge LLMs in the area they’re weakest, creativity and nuance.

Evaluation: Not All Human Judgments Are Created Equal

One of the most important sections in the paper explores how to evaluate translations properly. Not all human evaluation schemes are equal, and depending on how you test them, even good human translations can look bad.

The authors compared three schemes:

MQM (Multidimensional Quality Metrics), the industry standard for technical translations.
SQM (Scalar Quality Metric), a simple 0-100 rating scale.
BWS (Best-Worst Scaling), a direct comparison where evaluators choose the best and worst among options.

Here’s the key takeaway:
👉 Complex doesn’t mean better.
MQM, despite being detailed and widely used, failed hard in the literary context, especially with student evaluators. It misjudged human translations 60% of the time. On the other hand, BWS, a much simpler method, aligned far more closely with professional opinions and consistently picked human translations as better.

Evaluator expertise also played a huge role. Professional translators made far more accurate judgments than students, regardless of the evaluation scheme.

So, How Do LLMs Actually Perform?

The paper compared outputs from a wide range of systems:

🧠 Modern LLMs: GPT-4o, LLaMA 3, Qwen2, Gemma 1.1, TowerInstruct
📦 Traditional MT: Google Translate, DeepL
🧪 Older SOTA: M2M_100, NLLB

Across the board, human translations won, not just barely, but clearly. Here's why:

LLMs tend to produce more literal translations.
They show higher lexical overlap with the source text and with each other.
Their outputs are less diverse in structure and style.
Human translations diverge more, indicating creativity and independent interpretation.

Interestingly, even GPT-4o, the current flagship LLM, still trails human performance when judged properly. The more we expect artistic value and creative interpretation, the more obvious the gap becomes.

Automatic Metrics? Still Not There.

The paper also evaluates state-of-the-art automatic metrics, including:

Prometheus 2
XCOMET-XL/XXL
GEMBA-MQM (Original & Literary versions)

All of them struggle to match human evaluators. In fact, some are worse than random guesses when it comes to ranking human translations above machine ones.

Only GEMBA-MQM (Literary) showed decent correlation with human evaluation, but even that mostly focused on accuracy, not style, fluency, or terminology. Which, as anyone who's read translated poetry knows, are exactly what matters most.

A Few Observations From a Developer’s Perspective

This study reinforced a few things I’ve been noticing when playing with translation tasks using LLMs:

🧪 Literalness is not fluency. LLMs tend to play it safe. They’re great at giving something that’s “technically correct”, but literary translation is about taking risks with language.
🤖 Automatic metrics lag behind model capabilities. We need new metrics that go beyond BLEU, ROUGE, or even learned scoring models. Otherwise, we’re measuring Shakespeare with a spelling test.
💡 Evaluator design is everything. If you’re running your own experiments, pick your evaluation method carefully. Simple schemes like BWS often give you the cleanest insights.

This paper isn’t just a critique, it’s a guidebook for how we should be evaluating LLMs in areas that aren’t just about information, but about art.

Final Thoughts

The hype around LLMs is real, but so is the nuance. When it comes to literary translation, we’re still firmly in the era where human creativity and context win. That doesn’t mean LLMs have no place, but it does mean we need to evaluate them with the right tools, and not assume they’ve solved a problem just because they can pass a benchmark.

And also, remember that, we are not there "yet." For sure, with better models in the future, we get better translations. But for now, human creativity still leads the way when it comes to capturing the essence of literature.

"Even the greatest AI starts with a print statement, keep coding."

Link to the paper: https://arxiv.org/pdf/2410.18697

LLM's are not good for translation, "yet"

Table of contents

LLM's are not good for translation, "yet"

The Unique Challenge of Literary Translation

LITEVAL-CORPUS: A Serious Dataset for Serious Evaluation

Evaluation: Not All Human Judgments Are Created Equal

So, How Do LLMs Actually Perform?

Automatic Metrics? Still Not There.

A Few Observations From a Developer’s Perspective

Final Thoughts

Subscribe to my newsletter

Mojtaba Maleki

Mojtaba Maleki