Leaderboard =========== Live Leaderboard ---------------- Visit the **interactive leaderboard** at: https://contextbench.github.io/ The live leaderboard features: - Real-time rankings of all evaluated models and agents - Interactive filtering by benchmark variant (Verified, Pro, Poly, Multi) - Detailed breakdowns by granularity (File, Symbol, Span, EditLoc) - Trajectory efficiency metrics (AUC-Coverage, Redundancy) - Cost analysis and performance comparisons Current Rankings ---------------- Main Board (Verified Split) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Top-performing systems on the ContextBench Verified benchmark (500 instances): .. list-table:: :header-rows: 1 :widths: 20 15 15 15 15 15 15 * - Agent - File Cov. ↑ - File Prec. ↑ - Symbol Cov. ↑ - Symbol Prec. ↑ - AUC-Cov ↑ - Redun. ↓ * - Prometheus - 0.799 - 0.346 - 0.716 - 0.255 - 0.598 - 0.422 * - Agentless - 0.656 - 0.398 - 0.357 - 0.393 - 0.056 - 0.000 * - SWE-Agent - 0.576 - 0.496 - 0.436 - 0.233 - 0.563 - 0.094 Backbone Model Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~ Performance of different LLM backbones (using Mini SWE-agent): .. list-table:: :header-rows: 1 :widths: 25 15 15 15 15 15 15 * - Backbone - Pass@1 ↑ - Context F1 ↑ - Efficiency ↑ - File F1 ↑ - Symbol F1 ↑ - Span F1 ↑ * - Claude Sonnet 4.5 - 53.0% - 0.344 - 0.658 - 0.468 - 0.496 - 0.468 * - GPT-5 - 47.2% - 0.312 - 0.591 - 0.468 - 0.496 - 0.468 * - Devstral 2 - 40.2% - 0.332 - 0.616 - 0.384 - 0.489 - 0.456 * - Gemini 2.5 Pro - 36.4% - 0.311 - 0.529 - 0.460 - 0.433 - 0.362 Key Findings ------------ The Bitter Lesson of Coding Agents ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sophisticated agent scaffolding yields only **marginal gains** in context retrieval compared to raw LLM capabilities. The backbone model choice has a much larger impact than agent architecture. Recall vs. Precision Trade-off ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LLMs consistently favor **recall over precision**, retrieving broad context at the cost of including irrelevant code. This suggests a need for better filtering mechanisms. Explored vs. Utilized Context Gap ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Substantial gaps exist between context that agents **explore** (view) and context they actually **utilize** (reference in edits). This indicates inefficient context usage. Submitting Results ------------------ To submit your agent's results to the leaderboard: 1. **Run evaluation** on all benchmark splits 2. **Generate results** in the standard JSON Lines format 3. **Submit via GitHub** by opening a pull request with your results 4. **Include metadata**: Model name, version, cost, and configuration See the `submission guidelines `_ for details. Evaluation Criteria ------------------- Submissions are ranked by: 1. **Primary metric**: Pass@1 (task success rate) 2. **Context F1**: Balanced file/symbol/span F1 score 3. **Efficiency**: AUC-Coverage (how quickly relevant context is found) 4. **Cost**: Average inference cost per instance Benchmark Variants ------------------ Verified ~~~~~~~~ - **Size**: 500 instances - **Source**: SWE-bench Verified - **Difficulty**: Moderate - **Languages**: Primarily Python Pro ~~~ - **Size**: 2,294 instances - **Source**: SWE-bench Pro - **Difficulty**: High - **Languages**: Python, Java, JavaScript Poly ~~~~ - **Size**: 640 instances - **Source**: SWE-PolyBench - **Difficulty**: Moderate-High - **Languages**: Python, Java, JavaScript, TypeScript, Go, Rust, C, C++ Multi ~~~~~ - **Size**: 1,000+ instances - **Source**: Multi-SWE-bench - **Difficulty**: High - **Languages**: Multiple Next Steps ---------- - Visit the **live leaderboard**: https://contextbench.github.io/ - Learn how to :doc:`run_agent_on_contextbench` - Understand the :doc:`metrics` used for ranking - See :doc:`citation` to reference ContextBench in your work