Evaluation Pipeline

Overview

ContextBench evaluates agents through a systematic pipeline that compares predicted context (from agent trajectories) against gold annotations (human-verified context) at multiple granularities.

Pipeline Steps

1. Trajectory Extraction

Extract file views and spans from agent trajectories:

Parse trajectory format (.traj.json, .checkpoints.jsonl, etc.)
Identify file access commands (cat, view, grep, etc.)
Extract line ranges and byte spans for each viewed file
Track per-step context evolution

2. Repository Checkout

Clone and checkout the target repository:

Use cached clones when available
Checkout the specific commit/version for the task
Verify repository state matches gold annotations

3. Symbol Extraction

Use tree-sitter to extract code symbols:

Parse files to build abstract syntax trees (AST)
Identify definitions: classes, functions, methods
Map byte spans to symbol boundaries
Handle 8 programming languages: Python, Java, JavaScript, TypeScript, Go, Rust, C, C++

4. Gold Context Loading

Load human-annotated gold context:

File-level: Set of relevant file paths
Symbol-level: Set of relevant class/function definitions
Span-level: Byte intervals of relevant code regions
EditLoc-level: Locations where edits were made

5. Metric Computation

Compute metrics at multiple granularities:

Set-based metrics (File, Symbol):

\[ \begin{align}\begin{aligned}\text{Coverage} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Gold}|}\\\text{Precision} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Pred}|}\\\text{F1} = 2 \cdot \frac{\text{Coverage} \cdot \text{Precision}}{\text{Coverage} + \text{Precision}}\end{aligned}\end{align} \]

Interval-based metrics (Span):

\[ \begin{align}\begin{aligned}\text{Coverage} = \frac{\text{bytes}(\text{Gold} \cap \text{Pred})}{\text{bytes}(\text{Gold})}\\\text{Precision} = \frac{\text{bytes}(\text{Gold} \cap \text{Pred})}{\text{bytes}(\text{Pred})}\end{aligned}\end{align} \]

Trajectory metrics:

AUC-Coverage: Area under the per-step coverage curve (measures efficiency)
Redundancy: Fraction of context re-examined across steps

6. Result Aggregation

Output comprehensive evaluation:

Per-instance metrics (JSON Lines format)
Aggregated statistics (macro/micro averages)
Trajectory visualization data

Granularity Levels

File-Level

Evaluates which files were retrieved:

Gold: Set of file paths that contain relevant context
Pred: Set of files the agent viewed
Use case: Coarse-grained context retrieval

Symbol-Level

Evaluates which code symbols (functions, classes) were identified:

Gold: Set of (file, symbol_name) tuples with relevant definitions
Pred: Symbols covered by viewed spans
Use case: Fine-grained context understanding

Span-Level

Evaluates exact byte ranges of viewed code:

Gold: Union of byte intervals containing relevant code
Pred: Union of byte intervals the agent examined
Use case: Most precise retrieval measurement

EditLoc-Level

Evaluates edit location prediction:

Gold: Lines where the ground-truth patch made changes
Pred: Lines the agent identified for editing
Use case: Measures localization accuracy

Supported Formats

ContextBench automatically detects and parses multiple trajectory formats:

MiniSWE-agent: .traj.json files
SWE-agent: .checkpoints.jsonl files
Agentless: Custom JSON format
OpenHands: Trajectory logs
Prometheus: Agent-specific format

See Agent Trajectory Extractors for details on each format.

Next Steps

Understand the Metrics in detail
Learn about Agent Trajectory Extractors trajectory formats
See Evaluation for advanced usage