Quick Start
===========

This guide will help you get started with ContextBench evaluation in minutes.

Basic Usage
-----------

The simplest way to evaluate an agent trajectory:

.. code-block:: bash

   python -m contextbench.evaluate \
       --gold data/full.parquet \
       --pred path/to/trajectory.traj.json \
       --out results.jsonl

This command will:

1. Load the gold annotations from ``data/full.parquet``
2. Parse the trajectory file (auto-detects format)
3. Clone the target repository (if not cached)
4. Extract symbols using tree-sitter
5. Compute all metrics
6. Save results to ``results.jsonl``

Example with Trajectory
------------------------

Evaluate a single trajectory on the verified benchmark:

.. code-block:: bash

   cd /path/to/contextbench/ContextBench

   # Example with MiniSWE-agent trajectory
   python -m contextbench.evaluate \
       --gold data/full.parquet \
       --pred traj_verified/psf__requests-1142/psf__requests-1142.traj.json \
       --out result.jsonl

Understanding the Output
------------------------

The evaluation produces a JSON Lines file with detailed metrics for each instance:

.. code-block:: json

   {
     "instance_id": "psf__requests-1142",
     "num_steps": 6,
     "final": {
       "file": {"coverage": 1.0, "precision": 0.5, "f1": 0.667},
       "symbol": {"coverage": 1.0, "precision": 0.125, "f1": 0.222},
       "span": {"coverage": 1.0, "precision": 0.126, "f1": 0.224}
     },
     "trajectory": {
      "steps": [
        {"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
        {"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}}
      ],
       "auc_coverage": {"file": 1.0, "symbol": 1.0, "span": 1.0},
       "redundancy": {"file": 0.5, "symbol": 0.58, "span": 0.12}
       },
     "editloc": {"recall": 0.782, "precision": 1.0, "f1": 0.878}
   }

Key Metrics
~~~~~~~~~~~

- **Coverage**: What fraction of gold context was retrieved?
- **Precision**: What fraction of retrieved context was relevant?
- **F1**: Harmonic mean of coverage and precision
- **AUC-Coverage**: Area under the coverage curve (trajectory efficiency)
- **Redundancy**: How much context was re-examined?

Environment Variables
---------------------

You can set default paths to avoid repeating arguments:

.. code-block:: bash

   export CONTEXTBENCH_GOLD=/path/to/full.parquet
   export CONTEXTBENCH_CACHE=/path/to/repos_cache
   export CONTEXTBENCH_SELECTED_CSV=/path/to/selected_500_instances.csv

Then run evaluation with fewer arguments:

.. code-block:: bash

   python -m contextbench.evaluate \
       --pred path/to/trajectory.traj.json \
       --out result.jsonl

Next Steps
----------

- Read about :doc:`metrics` to understand the evaluation framework
- Learn how to :doc:`run_agent_on_contextbench` for batch evaluation
- Explore :doc:`agents` to see supported agent frameworks