Evaluation

This page describes how to run ContextBench evaluations and interpret results.

Run a single evaluation

python -m contextbench.evaluate \
    --gold data/full.parquet \
    --pred path/to/trajectory.traj.json \
    --out results.jsonl

Common options

--gold: Parquet/JSONL path containing gold contexts.
--pred: Path to an agent trajectory file.
--cache: Repository cache directory (optional).
--out: Output JSONL path.

Next steps

See Metrics for metric definitions.
See Run Agents on ContextBench for batch runs.