Karokan Research
Frontier evaluation
for European AI.
Benchmarks, evaluation environments, and open datasets to advance AI capabilities and accountability across Europe.
1M+
Annotations produced
500+
Expert contributors
6+
Domains covered
3
Published benchmarks
Data, evals, and post-training
Used by leading European AI labs and enterprises.
Frontier data
When model capabilities reach their limits, progress depends on data quality. Karokan mobilizes deep subject-matter experts across professional and institutional domains to produce specialized data at scale.
Model evaluation
Frontier-grade evaluation unlocks advanced reasoning, regulatory compliance, and reliable behavior. We deliver rigorous, diverse assessments grounded in European professional contexts.
RL environments
We build reinforcement learning environments in structured steps: realistic data-rich worlds capturing professional behavior, tools for agent interaction, and rigorous tasks with verifiers.
Benchmarks
Open evaluation frameworks for European AI systems.
Assesses whether frontier AI models can perform economically valuable professional tasks in European contexts — EU law, multi-country taxation, industrial standards, and cross-border regulatory analysis.
Leaderboard · top 3
Full leaderboard →The first rigorous benchmark evaluating LLM quality beyond English, across all 24 official EU languages in professional and institutional contexts.
Leaderboard · top 3
Full leaderboard →The first benchmark evaluating whether AI systems satisfy EU AI Act requirements: risk classification, documentation, transparency, and human oversight at model and system level.
First release scheduled for Q4 2026. Contact the research team to review the benchmark design and pilot evaluation criteria.
Research notes
Recent articles from the Karokan Research team.
Why English-only benchmarks fail European AI
English-first evaluation obscures domain failures that emerge under legal, administrative, and multilingual European contexts.
Feb 25, 2026Evaluating Mistral Large 3 on EU legal reasoning
A structured review of performance across risk classification, regulatory interpretation, and institutional document analysis.
Jan 30, 2026Toward the first AI Act compliance benchmark
Design principles for testing traceability, transparency, and human oversight requirements at model and system level.
Collaborate
Work with our research team
Labs, enterprises, and public institutions use Karokan Research to design benchmarks, validate multilingual performance, and define evaluation protocols grounded in European constraints.
Contact research →