Quick Start

This guide will help you get started with ContextBench evaluation in minutes.

Basic Usage

The simplest way to evaluate an agent trajectory:

python -m contextbench.evaluate \
    --gold data/full.parquet \
    --pred path/to/trajectory.traj.json \
    --out results.jsonl

This command will:

Load the gold annotations from data/full.parquet
Parse the trajectory file (auto-detects format)
Clone the target repository (if not cached)
Extract symbols using tree-sitter
Compute all metrics
Save results to results.jsonl

Example with Trajectory

Evaluate a single trajectory on the verified benchmark:

cd /path/to/contextbench/ContextBench

# Example with MiniSWE-agent trajectory
python -m contextbench.evaluate \
    --gold data/full.parquet \
    --pred traj_verified/psf__requests-1142/psf__requests-1142.traj.json \
    --out result.jsonl

Understanding the Output

The evaluation produces a JSON Lines file with detailed metrics for each instance:

{
  "instance_id": "psf__requests-1142",
  "num_steps": 6,
  "final": {
    "file": {"coverage": 1.0, "precision": 0.5, "f1": 0.667},
    "symbol": {"coverage": 1.0, "precision": 0.125, "f1": 0.222},
    "span": {"coverage": 1.0, "precision": 0.126, "f1": 0.224}
  },
  "trajectory": {
   "steps": [
     {"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
     {"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}}
   ],
    "auc_coverage": {"file": 1.0, "symbol": 1.0, "span": 1.0},
    "redundancy": {"file": 0.5, "symbol": 0.58, "span": 0.12}
    },
  "editloc": {"recall": 0.782, "precision": 1.0, "f1": 0.878}
}

Key Metrics

Coverage: What fraction of gold context was retrieved?
Precision: What fraction of retrieved context was relevant?
F1: Harmonic mean of coverage and precision
AUC-Coverage: Area under the coverage curve (trajectory efficiency)
Redundancy: How much context was re-examined?

Environment Variables

You can set default paths to avoid repeating arguments:

export CONTEXTBENCH_GOLD=/path/to/full.parquet
export CONTEXTBENCH_CACHE=/path/to/repos_cache
export CONTEXTBENCH_SELECTED_CSV=/path/to/selected_500_instances.csv

Then run evaluation with fewer arguments:

python -m contextbench.evaluate \
    --pred path/to/trajectory.traj.json \
    --out result.jsonl

Next Steps

Read about Metrics to understand the evaluation framework
Learn how to Run Agents on ContextBench for batch evaluation
Explore Agent Trajectory Extractors to see supported agent frameworks