European Multilingual Evaluation
The first rigorous benchmark evaluating LLM quality beyond English, across all 24 official EU languages in professional and institutional contexts.
2,880
Total tasks
148
Native authors
24
Languages
4
Task genres
Leaderboard
Ranked by overall score · last updated March 2026
Languages covered
All 24 official EU languages · native-authored tasks
About
KAROKAN-LANG systematically evaluates large language models across all 24 official EU languages using tasks sourced from real professional and administrative settings. It exposes performance gaps hidden by English-only benchmarks and reveals how model quality degrades — or holds — as context shifts to non-English European languages. Tasks are native-language originals, not translations, authored by professional linguists and domain experts who are native speakers.
Methodology
Each of the 24 EU languages is represented by at least 120 native-authored tasks. Tasks span four genres: legislative text comprehension, administrative correspondence drafting, professional document summarization, and cross-lingual knowledge retrieval. Scoring uses a combination of automated metrics (BLEU, BERTScore) and human evaluation for a representative 15% sample per language.
Cite
@misc{karokan2026lang,
title={KAROKAN-LANG: European Multilingual Evaluation},
author={Karokan Research Team},
year={2026},
url={https://karokan.com/research/karokan-lang}
}Get involved
Access the dataset, submit a model for evaluation, or collaborate with our research team.
Contact research →Other benchmarks
The European AI Productivity Index
Assesses whether frontier AI models can perform economically valuable professional tasks in European contexts — EU law, multi-country taxation, industrial standards, and cross-border regulatory analysis.
AI Act Compliance Benchmark
The first benchmark evaluating whether AI systems satisfy EU AI Act requirements: risk classification, documentation, transparency, and human oversight at model and system level.