Evaluation
This page describes how to run ContextBench evaluations and interpret results.
Run a single evaluation
python -m contextbench.evaluate \
--gold data/full.parquet \
--pred path/to/trajectory.traj.json \
--out results.jsonl
Common options
--gold: Parquet/JSONL path containing gold contexts.--pred: Path to an agent trajectory file.--cache: Repository cache directory (optional).--out: Output JSONL path.
Next steps
See Metrics for metric definitions.
See Run Agents on ContextBench for batch runs.