Karokan Research

Frontier evaluation
for European AI.

Benchmarks, evaluation environments, and open datasets to advance AI capabilities and accountability across Europe.

View benchmarks

1M+

Annotations produced

500+

Expert contributors

Domains covered

Published benchmarks

Data, evals, and post-training

Used by leading European AI labs and enterprises.

Frontier data

When model capabilities reach their limits, progress depends on data quality. Karokan mobilizes deep subject-matter experts across professional and institutional domains to produce specialized data at scale.

Model evaluation

Frontier-grade evaluation unlocks advanced reasoning, regulatory compliance, and reliable behavior. We deliver rigorous, diverse assessments grounded in European professional contexts.

RL environments

We build reinforcement learning environments in structured steps: realistic data-rich worlds capturing professional behavior, tools for agent interaction, and rigorous tasks with verifiers.

Benchmarks

Open evaluation frameworks for European AI systems.

KAROKAN-EUThe European AI Productivity Index

Assesses whether frontier AI models can perform economically valuable professional tasks in European contexts — EU law, multi-country taxation, industrial standards, and cross-border regulatory analysis.

View leaderboard →

Blog Paper Data Code Sample task

Leaderboard · top 3

Full leaderboard →

Mistral Large 3· High

54.2%±2.6%

Claude Opus 4.6· High

51.8%±2.9%

GPT-5.4· High

49.3%±3.1%

KAROKAN-LANGEuropean Multilingual Evaluation

The first rigorous benchmark evaluating LLM quality beyond English, across all 24 official EU languages in professional and institutional contexts.

View leaderboard →

Blog Paper Data Code

Leaderboard · top 3

Full leaderboard →

Mistral Large 3

71.8%±1.9%

Claude Opus 4.6

68.4%±2.1%

GPT-5.4

66.2%±2.4%

KAROKAN-ACTAI Act Compliance BenchmarkQ4 2026

The first benchmark evaluating whether AI systems satisfy EU AI Act requirements: risk classification, documentation, transparency, and human oversight at model and system level.

View leaderboard →

Blog Paper Data Code

First release scheduled for Q4 2026. Contact the research team to review the benchmark design and pilot evaluation criteria.

Research notes

Recent articles from the Karokan Research team.

All articles →

Mar 12, 2026

Why English-only benchmarks fail European AI

English-first evaluation obscures domain failures that emerge under legal, administrative, and multilingual European contexts.

Feb 25, 2026

Evaluating Mistral Large 3 on EU legal reasoning

A structured review of performance across risk classification, regulatory interpretation, and institutional document analysis.

Jan 30, 2026

Toward the first AI Act compliance benchmark

Design principles for testing traceability, transparency, and human oversight requirements at model and system level.

Collaborate

Work with our research team

Labs, enterprises, and public institutions use Karokan Research to design benchmarks, validate multilingual performance, and define evaluation protocols grounded in European constraints.

Contact research →

Karokan Research

Frontier evaluation
for European AI.

Benchmarks, evaluation environments, and open datasets to advance AI capabilities and accountability across Europe.

View benchmarks

1M+

Annotations produced

500+

Expert contributors

Domains covered

Published benchmarks

Data, evals, and post-training

Used by leading European AI labs and enterprises.

Frontier data

Model evaluation

Frontier-grade evaluation unlocks advanced reasoning, regulatory compliance, and reliable behavior. We deliver rigorous, diverse assessments grounded in European professional contexts.

RL environments

We build reinforcement learning environments in structured steps: realistic data-rich worlds capturing professional behavior, tools for agent interaction, and rigorous tasks with verifiers.

Benchmarks

Open evaluation frameworks for European AI systems.

KAROKAN-EUThe European AI Productivity Index

View leaderboard →

Blog Paper Data Code Sample task

Leaderboard · top 3

Full leaderboard →

Mistral Large 3· High

54.2%±2.6%

Claude Opus 4.6· High

51.8%±2.9%

GPT-5.4· High

49.3%±3.1%

KAROKAN-LANGEuropean Multilingual Evaluation

The first rigorous benchmark evaluating LLM quality beyond English, across all 24 official EU languages in professional and institutional contexts.

View leaderboard →

Blog Paper Data Code

Leaderboard · top 3

Full leaderboard →

Mistral Large 3

71.8%±1.9%

Claude Opus 4.6

68.4%±2.1%

GPT-5.4

66.2%±2.4%

KAROKAN-ACTAI Act Compliance BenchmarkQ4 2026

The first benchmark evaluating whether AI systems satisfy EU AI Act requirements: risk classification, documentation, transparency, and human oversight at model and system level.

View leaderboard →

Blog Paper Data Code

First release scheduled for Q4 2026. Contact the research team to review the benchmark design and pilot evaluation criteria.

Research notes

Recent articles from the Karokan Research team.

All articles →

Mar 12, 2026

Work with our research team

Labs, enterprises, and public institutions use Karokan Research to design benchmarks, validate multilingual performance, and define evaluation protocols grounded in European constraints.

Contact research →

Frontier evaluation for European AI.

Data, evals, and post-training

Frontier data

Model evaluation

RL environments

Benchmarks

Research notes

Why English-only benchmarks fail European AI

Evaluating Mistral Large 3 on EU legal reasoning

Toward the first AI Act compliance benchmark

Work with our research team

Frontier evaluation for European AI.

Data, evals, and post-training

Frontier data

Model evaluation

RL environments

Benchmarks

Research notes

Why English-only benchmarks fail European AI

Evaluating Mistral Large 3 on EU legal reasoning

Toward the first AI Act compliance benchmark

Work with our research team

Frontier evaluation
for European AI.

Frontier evaluation
for European AI.