KAROKAN-LANGEuropean Multilingual Evaluation

European Multilingual Evaluation

The first rigorous benchmark evaluating LLM quality beyond English, across all 24 official EU languages in professional and institutional contexts.

Blog Paper Data Code

2,880

Total tasks

148

Native authors

Languages

Task genres

Leaderboard

Ranked by overall score · last updated March 2026

#ModelScore

Mistral Large 3Mistral AI

71.8%±1.9%

Claude Opus 4.6Anthropic

68.4%±2.1%

GPT-5.4OpenAI

66.2%±2.4%

Gemini 2.5 ProGoogle DeepMind

63.9%±2.3%

Llama 4 ScoutMeta

59.4%±2.9%

Command R+Cohere

55.8%±3.1%

DeepSeek V3DeepSeek

51.3%±3.5%

Grok 3xAI

48.7%±3.8%

Languages covered

All 24 official EU languages · native-authored tasks

BulgarianCroatianCzechDanishDutchEnglishEstonianFinnishFrenchGermanGreekHungarianIrishItalianLatvianLithuanianMaltesePolishPortugueseRomanianSlovakSlovenianSpanishSwedish

About

KAROKAN-LANG systematically evaluates large language models across all 24 official EU languages using tasks sourced from real professional and administrative settings. It exposes performance gaps hidden by English-only benchmarks and reveals how model quality degrades — or holds — as context shifts to non-English European languages. Tasks are native-language originals, not translations, authored by professional linguists and domain experts who are native speakers.

Methodology

Each of the 24 EU languages is represented by at least 120 native-authored tasks. Tasks span four genres: legislative text comprehension, administrative correspondence drafting, professional document summarization, and cross-lingual knowledge retrieval. Scoring uses a combination of automated metrics (BLEU, BERTScore) and human evaluation for a representative 15% sample per language.

Cite

@misc{karokan2026lang,
  title={KAROKAN-LANG: European Multilingual Evaluation},
  author={Karokan Research Team},
  year={2026},
  url={https://karokan.com/research/karokan-lang}
}

Get involved

Access the dataset, submit a model for evaluation, or collaborate with our research team.

Contact research →

Other benchmarks

KAROKAN-EU

The European AI Productivity Index

Assesses whether frontier AI models can perform economically valuable professional tasks in European contexts — EU law, multi-country taxation, industrial standards, and cross-border regulatory analysis.

KAROKAN-ACTQ4 2026

AI Act Compliance Benchmark

The first benchmark evaluating whether AI systems satisfy EU AI Act requirements: risk classification, documentation, transparency, and human oversight at model and system level.

About

Methodology