Leaderboard

Live Leaderboard

Visit the interactive leaderboard at: https://contextbench.github.io/

The live leaderboard features:

Real-time rankings of all evaluated models and agents
Interactive filtering by benchmark variant (Verified, Pro, Poly, Multi)
Detailed breakdowns by granularity (File, Symbol, Span, EditLoc)
Trajectory efficiency metrics (AUC-Coverage, Redundancy)
Cost analysis and performance comparisons

Current Rankings

Main Board (Verified Split)

Top-performing systems on the ContextBench Verified benchmark (500 instances):

Agent	File Cov. ↑	File Prec. ↑	Symbol Cov. ↑	Symbol Prec. ↑	AUC-Cov ↑	Redun. ↓
Prometheus	0.799	0.346	0.716	0.255	0.598	0.422
Agentless	0.656	0.398	0.357	0.393	0.056	0.000
SWE-Agent	0.576	0.496	0.436	0.233	0.563	0.094

Backbone Model Comparison

Performance of different LLM backbones (using Mini SWE-agent):

Backbone	Pass@1 ↑	Context F1 ↑	Efficiency ↑	File F1 ↑	Symbol F1 ↑	Span F1 ↑
Claude Sonnet 4.5	53.0%	0.344	0.658	0.468	0.496	0.468
GPT-5	47.2%	0.312	0.591	0.468	0.496	0.468
Devstral 2	40.2%	0.332	0.616	0.384	0.489	0.456
Gemini 2.5 Pro	36.4%	0.311	0.529	0.460	0.433	0.362

Key Findings

The Bitter Lesson of Coding Agents

Sophisticated agent scaffolding yields only marginal gains in context retrieval compared to raw LLM capabilities. The backbone model choice has a much larger impact than agent architecture.

Recall vs. Precision Trade-off

LLMs consistently favor recall over precision, retrieving broad context at the cost of including irrelevant code. This suggests a need for better filtering mechanisms.

Explored vs. Utilized Context Gap

Substantial gaps exist between context that agents explore (view) and context they actually utilize (reference in edits). This indicates inefficient context usage.

Submitting Results

To submit your agent’s results to the leaderboard:

Run evaluation on all benchmark splits
Generate results in the standard JSON Lines format
Submit via GitHub by opening a pull request with your results
Include metadata: Model name, version, cost, and configuration

See the submission guidelines for details.

Evaluation Criteria

Submissions are ranked by:

Primary metric: Pass@1 (task success rate)
Context F1: Balanced file/symbol/span F1 score
Efficiency: AUC-Coverage (how quickly relevant context is found)
Cost: Average inference cost per instance

Benchmark Variants

Verified

Size: 500 instances
Source: SWE-bench Verified
Difficulty: Moderate
Languages: Primarily Python

Pro

Size: 2,294 instances
Source: SWE-bench Pro
Difficulty: High
Languages: Python, Java, JavaScript

Poly

Size: 640 instances
Source: SWE-PolyBench
Difficulty: Moderate-High
Languages: Python, Java, JavaScript, TypeScript, Go, Rust, C, C++

Multi

Size: 1,000+ instances
Source: Multi-SWE-bench
Difficulty: High
Languages: Multiple

Next Steps

Visit the live leaderboard: https://contextbench.github.io/
Learn how to Run Agents on ContextBench
Understand the Metrics used for ranking
See Citation to reference ContextBench in your work