Metrics
=======

ContextBench provides comprehensive metrics at multiple granularities to evaluate context retrieval in coding agents.

Granularity Levels
------------------

ContextBench evaluates context retrieval at four granularities:

1. **File-level**: Which files contain relevant context?
2. **Symbol-level**: Which functions/classes are relevant?
3. **Span-level**: What exact code regions are relevant?
4. **EditLoc-level**: Where should edits be made?

Core Metrics
------------

For each granularity, we compute:

Coverage (Recall)
~~~~~~~~~~~~~~~~~

What fraction of the gold context was retrieved?

.. math::

   \text{Coverage} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Gold}|}

- **Range**: [0, 1]
- **Higher is better**
- **Interpretation**: Did the agent find all relevant context?

Precision
~~~~~~~~~

What fraction of the retrieved context was relevant?

.. math::

   \text{Precision} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Pred}|}

- **Range**: [0, 1]
- **Higher is better**
- **Interpretation**: Did the agent avoid irrelevant context?

F1 Score
~~~~~~~~

Harmonic mean of coverage and precision:

.. math::

   \text{F1} = 2 \cdot \frac{\text{Coverage} \cdot \text{Precision}}{\text{Coverage} + \text{Precision}}

- **Range**: [0, 1]
- **Higher is better**
- **Interpretation**: Balanced measure of retrieval quality

Trajectory Metrics
------------------

In addition to final context, ContextBench tracks **how** agents retrieve context over time:

AUC-Coverage
~~~~~~~~~~~~

Area under the per-step coverage curve:

.. math::

   \text{AUC-Coverage} = \frac{1}{N} \sum_{i=1}^{N} \text{Coverage}_i

where :math:`N` is the number of steps and :math:`\text{Coverage}_i` is coverage at step :math:`i`.

- **Range**: [0, 1]
- **Higher is better**
- **Interpretation**: How quickly did the agent find relevant context?

A higher AUC-Coverage means the agent found relevant context earlier in the trajectory.

Redundancy
~~~~~~~~~~

Fraction of context that was re-examined:

.. math::

   \text{Redundancy} = \frac{\sum_{i=2}^{N} |\text{Context}_i \cap \bigcup_{j=1}^{i-1} \text{Context}_j|}{\sum_{i=1}^{N} |\text{Context}_i|}

- **Range**: [0, 1]
- **Lower is better**
- **Interpretation**: How much duplicate examination occurred?

High redundancy indicates the agent repeatedly viewed the same context.

Per-Step Coverage
~~~~~~~~~~~~~~~~~

Coverage at each trajectory step:

.. code-block:: python

   "trajectory": {
     "steps": [
       {"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
       {"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}},
       ...
     ]
   }

This enables analyzing the agent's exploration strategy over time.

Granularity-Specific Details
-----------------------------

File-Level Metrics
~~~~~~~~~~~~~~~~~~

**Set operations**: Treat file paths as sets

.. code-block:: python

   gold_files = {"src/utils.py", "src/main.py"}
   pred_files = {"src/utils.py", "src/config.py", "tests/test.py"}
   
   coverage = len(gold_files & pred_files) / len(gold_files)  # 0.5
   precision = len(gold_files & pred_files) / len(pred_files)  # 0.33

Symbol-Level Metrics
~~~~~~~~~~~~~~~~~~~~

**Symbol extraction**: Use tree-sitter to identify definitions

.. code-block:: python

   # Gold symbols (file, symbol_name)
   gold_symbols = {
       ("src/utils.py", "parse_config"),
       ("src/utils.py", "Config"),
       ("src/main.py", "main")
   }
   
   # Pred symbols: extracted from viewed spans
   pred_symbols = extract_symbols_from_spans(pred_spans)

Only definition nodes (classes, functions, methods) are considered symbols.

Span-Level Metrics
~~~~~~~~~~~~~~~~~~

**Interval operations**: Compute union and intersection of byte ranges

.. code-block:: python

   from contextbench.core.intervals import union, intersection, measure
   
   gold_spans = {"file.py": [(0, 100), (200, 300)]}
   pred_spans = {"file.py": [(50, 150), (250, 350)]}
   
   gold_union = union(gold_spans["file.py"])      # Total gold bytes
   pred_union = union(pred_spans["file.py"])      # Total pred bytes
   overlap = intersection(gold_union, pred_union)  # Overlapping bytes
   
   coverage = measure(overlap) / measure(gold_union)
   precision = measure(overlap) / measure(pred_union)

EditLoc-Level Metrics
~~~~~~~~~~~~~~~~~~~~~

**Line-based comparison**: Compare sets of edited line numbers

.. code-block:: python

   # Gold: lines where the ground-truth patch made changes
   gold_edit_lines = {15, 16, 17, 42, 43}
   
   # Pred: lines the agent identified for editing
   pred_edit_lines = {16, 17, 18, 42, 100}
   
   coverage = len(gold_edit_lines & pred_edit_lines) / len(gold_edit_lines)
   precision = len(gold_edit_lines & pred_edit_lines) / len(pred_edit_lines)

Aggregation Methods
-------------------

Macro Average
~~~~~~~~~~~~~

Mean over instances (each instance counts equally):

.. math::

   \text{Macro-Avg}(\text{metric}) = \frac{1}{N} \sum_{i=1}^{N} \text{metric}_i

**Use case**: Standard benchmark reporting

Micro Average
~~~~~~~~~~~~~

Aggregate sizes first, then compute metric:

.. math::

   \text{Micro-Coverage} = \frac{\sum_{i=1}^{N} |\text{Gold}_i \cap \text{Pred}_i|}{\sum_{i=1}^{N} |\text{Gold}_i|}

**Use case**: Size-weighted performance (larger tasks have more influence)

Interpreting Results
--------------------

High Coverage, Low Precision
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The agent retrieved most relevant context but also retrieved much irrelevant context.

**Example**: ``coverage=0.9, precision=0.2``

**Interpretation**: Agent uses a broad search strategy (recall-focused)

Low Coverage, High Precision
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The agent retrieved little context, but most of it was relevant.

**Example**: ``coverage=0.3, precision=0.8``

**Interpretation**: Agent uses a narrow search strategy (precision-focused)

Balanced F1
~~~~~~~~~~~

The agent achieves a good balance between coverage and precision.

**Example**: ``coverage=0.7, precision=0.6, f1=0.65``

**Interpretation**: Agent effectively retrieves relevant context without excessive noise

High Redundancy
~~~~~~~~~~~~~~~

The agent repeatedly examines the same context.

**Example**: ``redundancy=0.6``

**Interpretation**: Inefficient exploration; 60% of viewed context was already seen

Low AUC-Coverage
~~~~~~~~~~~~~~~~

The agent takes many steps to find relevant context.

**Example**: ``auc_coverage=0.3``

**Interpretation**: Relevant context is found late in the trajectory

Next Steps
----------

- See :doc:`evaluation` for running evaluations
- Understand the :doc:`pipeline` for how metrics are computed
- Explore :doc:`api/metrics` for implementation details