Evaluation
==========

This page describes how to run ContextBench evaluations and interpret results.

Run a single evaluation
-----------------------

.. code-block:: bash

   python -m contextbench.evaluate \
       --gold data/full.parquet \
       --pred path/to/trajectory.traj.json \
       --out results.jsonl

Common options
--------------

- ``--gold``: Parquet/JSONL path containing gold contexts.
- ``--pred``: Path to an agent trajectory file.
- ``--cache``: Repository cache directory (optional).
- ``--out``: Output JSONL path.

Next steps
----------

- See :doc:`metrics` for metric definitions.
- See :doc:`run_agent_on_contextbench` for batch runs.