Quick Start
This guide will help you get started with ContextBench evaluation in minutes.
Basic Usage
The simplest way to evaluate an agent trajectory:
python -m contextbench.evaluate \
--gold data/full.parquet \
--pred path/to/trajectory.traj.json \
--out results.jsonl
This command will:
Load the gold annotations from
data/full.parquetParse the trajectory file (auto-detects format)
Clone the target repository (if not cached)
Extract symbols using tree-sitter
Compute all metrics
Save results to
results.jsonl
Example with Trajectory
Evaluate a single trajectory on the verified benchmark:
cd /path/to/contextbench/ContextBench
# Example with MiniSWE-agent trajectory
python -m contextbench.evaluate \
--gold data/full.parquet \
--pred traj_verified/psf__requests-1142/psf__requests-1142.traj.json \
--out result.jsonl
Understanding the Output
The evaluation produces a JSON Lines file with detailed metrics for each instance:
{
"instance_id": "psf__requests-1142",
"num_steps": 6,
"final": {
"file": {"coverage": 1.0, "precision": 0.5, "f1": 0.667},
"symbol": {"coverage": 1.0, "precision": 0.125, "f1": 0.222},
"span": {"coverage": 1.0, "precision": 0.126, "f1": 0.224}
},
"trajectory": {
"steps": [
{"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
{"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}}
],
"auc_coverage": {"file": 1.0, "symbol": 1.0, "span": 1.0},
"redundancy": {"file": 0.5, "symbol": 0.58, "span": 0.12}
},
"editloc": {"recall": 0.782, "precision": 1.0, "f1": 0.878}
}
Key Metrics
Coverage: What fraction of gold context was retrieved?
Precision: What fraction of retrieved context was relevant?
F1: Harmonic mean of coverage and precision
AUC-Coverage: Area under the coverage curve (trajectory efficiency)
Redundancy: How much context was re-examined?
Environment Variables
You can set default paths to avoid repeating arguments:
export CONTEXTBENCH_GOLD=/path/to/full.parquet
export CONTEXTBENCH_CACHE=/path/to/repos_cache
export CONTEXTBENCH_SELECTED_CSV=/path/to/selected_500_instances.csv
Then run evaluation with fewer arguments:
python -m contextbench.evaluate \
--pred path/to/trajectory.traj.json \
--out result.jsonl
Next Steps
Read about Metrics to understand the evaluation framework
Learn how to Run Agents on ContextBench for batch evaluation
Explore Agent Trajectory Extractors to see supported agent frameworks