Leaderboard

Live Leaderboard

Visit the interactive leaderboard at: https://contextbench.github.io/

The live leaderboard features:

  • Real-time rankings of all evaluated models and agents

  • Interactive filtering by benchmark variant (Verified, Pro, Poly, Multi)

  • Detailed breakdowns by granularity (File, Symbol, Span, EditLoc)

  • Trajectory efficiency metrics (AUC-Coverage, Redundancy)

  • Cost analysis and performance comparisons

Current Rankings

Main Board (Verified Split)

Top-performing systems on the ContextBench Verified benchmark (500 instances):

Agent

File Cov. ↑

File Prec. ↑

Symbol Cov. ↑

Symbol Prec. ↑

AUC-Cov ↑

Redun. ↓

Prometheus

0.799

0.346

0.716

0.255

0.598

0.422

Agentless

0.656

0.398

0.357

0.393

0.056

0.000

SWE-Agent

0.576

0.496

0.436

0.233

0.563

0.094

Backbone Model Comparison

Performance of different LLM backbones (using Mini SWE-agent):

Backbone

Pass@1 ↑

Context F1 ↑

Efficiency ↑

File F1 ↑

Symbol F1 ↑

Span F1 ↑

Claude Sonnet 4.5

53.0%

0.344

0.658

0.468

0.496

0.468

GPT-5

47.2%

0.312

0.591

0.468

0.496

0.468

Devstral 2

40.2%

0.332

0.616

0.384

0.489

0.456

Gemini 2.5 Pro

36.4%

0.311

0.529

0.460

0.433

0.362

Key Findings

The Bitter Lesson of Coding Agents

Sophisticated agent scaffolding yields only marginal gains in context retrieval compared to raw LLM capabilities. The backbone model choice has a much larger impact than agent architecture.

Recall vs. Precision Trade-off

LLMs consistently favor recall over precision, retrieving broad context at the cost of including irrelevant code. This suggests a need for better filtering mechanisms.

Explored vs. Utilized Context Gap

Substantial gaps exist between context that agents explore (view) and context they actually utilize (reference in edits). This indicates inefficient context usage.

Submitting Results

To submit your agent’s results to the leaderboard:

  1. Run evaluation on all benchmark splits

  2. Generate results in the standard JSON Lines format

  3. Submit via GitHub by opening a pull request with your results

  4. Include metadata: Model name, version, cost, and configuration

See the submission guidelines for details.

Evaluation Criteria

Submissions are ranked by:

  1. Primary metric: Pass@1 (task success rate)

  2. Context F1: Balanced file/symbol/span F1 score

  3. Efficiency: AUC-Coverage (how quickly relevant context is found)

  4. Cost: Average inference cost per instance

Benchmark Variants

Verified

  • Size: 500 instances

  • Source: SWE-bench Verified

  • Difficulty: Moderate

  • Languages: Primarily Python

Pro

  • Size: 2,294 instances

  • Source: SWE-bench Pro

  • Difficulty: High

  • Languages: Python, Java, JavaScript

Poly

  • Size: 640 instances

  • Source: SWE-PolyBench

  • Difficulty: Moderate-High

  • Languages: Python, Java, JavaScript, TypeScript, Go, Rust, C, C++

Multi

  • Size: 1,000+ instances

  • Source: Multi-SWE-bench

  • Difficulty: High

  • Languages: Multiple

Next Steps