Metrics

ContextBench provides comprehensive metrics at multiple granularities to evaluate context retrieval in coding agents.

Granularity Levels

ContextBench evaluates context retrieval at four granularities:

File-level: Which files contain relevant context?
Symbol-level: Which functions/classes are relevant?
Span-level: What exact code regions are relevant?
EditLoc-level: Where should edits be made?

Core Metrics

For each granularity, we compute:

Coverage (Recall)

What fraction of the gold context was retrieved?

\[\text{Coverage} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Gold}|}\]

Range: [0, 1]
Higher is better
Interpretation: Did the agent find all relevant context?

Precision

What fraction of the retrieved context was relevant?

\[\text{Precision} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Pred}|}\]

Range: [0, 1]
Higher is better
Interpretation: Did the agent avoid irrelevant context?

F1 Score

Harmonic mean of coverage and precision:

\[\text{F1} = 2 \cdot \frac{\text{Coverage} \cdot \text{Precision}}{\text{Coverage} + \text{Precision}}\]

Range: [0, 1]
Higher is better
Interpretation: Balanced measure of retrieval quality

Trajectory Metrics

In addition to final context, ContextBench tracks how agents retrieve context over time:

AUC-Coverage

Area under the per-step coverage curve:

\[\text{AUC-Coverage} = \frac{1}{N} \sum_{i=1}^{N} \text{Coverage}_i\]

where \(N\) is the number of steps and \(\text{Coverage}_i\) is coverage at step \(i\).

Range: [0, 1]
Higher is better
Interpretation: How quickly did the agent find relevant context?

A higher AUC-Coverage means the agent found relevant context earlier in the trajectory.

Redundancy

Fraction of context that was re-examined:

\[\text{Redundancy} = \frac{\sum_{i=2}^{N} |\text{Context}_i \cap \bigcup_{j=1}^{i-1} \text{Context}_j|}{\sum_{i=1}^{N} |\text{Context}_i|}\]

Range: [0, 1]
Lower is better
Interpretation: How much duplicate examination occurred?

High redundancy indicates the agent repeatedly viewed the same context.

Per-Step Coverage

Coverage at each trajectory step:

"trajectory": {
  "steps": [
    {"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
    {"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}},
    ...
  ]
}

This enables analyzing the agent’s exploration strategy over time.

Granularity-Specific Details

File-Level Metrics

Set operations: Treat file paths as sets

gold_files = {"src/utils.py", "src/main.py"}
pred_files = {"src/utils.py", "src/config.py", "tests/test.py"}

coverage = len(gold_files & pred_files) / len(gold_files)  # 0.5
precision = len(gold_files & pred_files) / len(pred_files)  # 0.33

Symbol-Level Metrics

Symbol extraction: Use tree-sitter to identify definitions

# Gold symbols (file, symbol_name)
gold_symbols = {
    ("src/utils.py", "parse_config"),
    ("src/utils.py", "Config"),
    ("src/main.py", "main")
}

# Pred symbols: extracted from viewed spans
pred_symbols = extract_symbols_from_spans(pred_spans)

Only definition nodes (classes, functions, methods) are considered symbols.

Span-Level Metrics

Interval operations: Compute union and intersection of byte ranges

from contextbench.core.intervals import union, intersection, measure

gold_spans = {"file.py": [(0, 100), (200, 300)]}
pred_spans = {"file.py": [(50, 150), (250, 350)]}

gold_union = union(gold_spans["file.py"])      # Total gold bytes
pred_union = union(pred_spans["file.py"])      # Total pred bytes
overlap = intersection(gold_union, pred_union)  # Overlapping bytes

coverage = measure(overlap) / measure(gold_union)
precision = measure(overlap) / measure(pred_union)

EditLoc-Level Metrics

Line-based comparison: Compare sets of edited line numbers

# Gold: lines where the ground-truth patch made changes
gold_edit_lines = {15, 16, 17, 42, 43}

# Pred: lines the agent identified for editing
pred_edit_lines = {16, 17, 18, 42, 100}

coverage = len(gold_edit_lines & pred_edit_lines) / len(gold_edit_lines)
precision = len(gold_edit_lines & pred_edit_lines) / len(pred_edit_lines)

Aggregation Methods

Macro Average

Mean over instances (each instance counts equally):

\[\text{Macro-Avg}(\text{metric}) = \frac{1}{N} \sum_{i=1}^{N} \text{metric}_i\]

Use case: Standard benchmark reporting

Micro Average

Aggregate sizes first, then compute metric:

\[\text{Micro-Coverage} = \frac{\sum_{i=1}^{N} |\text{Gold}_i \cap \text{Pred}_i|}{\sum_{i=1}^{N} |\text{Gold}_i|}\]

Use case: Size-weighted performance (larger tasks have more influence)

Interpreting Results

High Coverage, Low Precision

The agent retrieved most relevant context but also retrieved much irrelevant context.

Example: coverage=0.9, precision=0.2

Interpretation: Agent uses a broad search strategy (recall-focused)

Low Coverage, High Precision

The agent retrieved little context, but most of it was relevant.

Example: coverage=0.3, precision=0.8

Interpretation: Agent uses a narrow search strategy (precision-focused)

Balanced F1

The agent achieves a good balance between coverage and precision.

Example: coverage=0.7, precision=0.6, f1=0.65

Interpretation: Agent effectively retrieves relevant context without excessive noise

High Redundancy

The agent repeatedly examines the same context.

Example: redundancy=0.6

Interpretation: Inefficient exploration; 60% of viewed context was already seen

Low AUC-Coverage

The agent takes many steps to find relevant context.

Example: auc_coverage=0.3

Interpretation: Relevant context is found late in the trajectory

Next Steps

See Evaluation for running evaluations
Understand the Evaluation Pipeline for how metrics are computed
Explore Metrics API for implementation details