Metrics
ContextBench provides comprehensive metrics at multiple granularities to evaluate context retrieval in coding agents.
Granularity Levels
ContextBench evaluates context retrieval at four granularities:
File-level: Which files contain relevant context?
Symbol-level: Which functions/classes are relevant?
Span-level: What exact code regions are relevant?
EditLoc-level: Where should edits be made?
Core Metrics
For each granularity, we compute:
Coverage (Recall)
What fraction of the gold context was retrieved?
Range: [0, 1]
Higher is better
Interpretation: Did the agent find all relevant context?
Precision
What fraction of the retrieved context was relevant?
Range: [0, 1]
Higher is better
Interpretation: Did the agent avoid irrelevant context?
F1 Score
Harmonic mean of coverage and precision:
Range: [0, 1]
Higher is better
Interpretation: Balanced measure of retrieval quality
Trajectory Metrics
In addition to final context, ContextBench tracks how agents retrieve context over time:
AUC-Coverage
Area under the per-step coverage curve:
where \(N\) is the number of steps and \(\text{Coverage}_i\) is coverage at step \(i\).
Range: [0, 1]
Higher is better
Interpretation: How quickly did the agent find relevant context?
A higher AUC-Coverage means the agent found relevant context earlier in the trajectory.
Redundancy
Fraction of context that was re-examined:
Range: [0, 1]
Lower is better
Interpretation: How much duplicate examination occurred?
High redundancy indicates the agent repeatedly viewed the same context.
Per-Step Coverage
Coverage at each trajectory step:
"trajectory": {
"steps": [
{"step": 1, "coverage": {"file": 0.5, "symbol": 0.3, "span": 0.2}},
{"step": 2, "coverage": {"file": 0.8, "symbol": 0.6, "span": 0.5}},
...
]
}
This enables analyzing the agent’s exploration strategy over time.
Granularity-Specific Details
File-Level Metrics
Set operations: Treat file paths as sets
gold_files = {"src/utils.py", "src/main.py"}
pred_files = {"src/utils.py", "src/config.py", "tests/test.py"}
coverage = len(gold_files & pred_files) / len(gold_files) # 0.5
precision = len(gold_files & pred_files) / len(pred_files) # 0.33
Symbol-Level Metrics
Symbol extraction: Use tree-sitter to identify definitions
# Gold symbols (file, symbol_name)
gold_symbols = {
("src/utils.py", "parse_config"),
("src/utils.py", "Config"),
("src/main.py", "main")
}
# Pred symbols: extracted from viewed spans
pred_symbols = extract_symbols_from_spans(pred_spans)
Only definition nodes (classes, functions, methods) are considered symbols.
Span-Level Metrics
Interval operations: Compute union and intersection of byte ranges
from contextbench.core.intervals import union, intersection, measure
gold_spans = {"file.py": [(0, 100), (200, 300)]}
pred_spans = {"file.py": [(50, 150), (250, 350)]}
gold_union = union(gold_spans["file.py"]) # Total gold bytes
pred_union = union(pred_spans["file.py"]) # Total pred bytes
overlap = intersection(gold_union, pred_union) # Overlapping bytes
coverage = measure(overlap) / measure(gold_union)
precision = measure(overlap) / measure(pred_union)
EditLoc-Level Metrics
Line-based comparison: Compare sets of edited line numbers
# Gold: lines where the ground-truth patch made changes
gold_edit_lines = {15, 16, 17, 42, 43}
# Pred: lines the agent identified for editing
pred_edit_lines = {16, 17, 18, 42, 100}
coverage = len(gold_edit_lines & pred_edit_lines) / len(gold_edit_lines)
precision = len(gold_edit_lines & pred_edit_lines) / len(pred_edit_lines)
Aggregation Methods
Macro Average
Mean over instances (each instance counts equally):
Use case: Standard benchmark reporting
Micro Average
Aggregate sizes first, then compute metric:
Use case: Size-weighted performance (larger tasks have more influence)
Interpreting Results
High Coverage, Low Precision
The agent retrieved most relevant context but also retrieved much irrelevant context.
Example: coverage=0.9, precision=0.2
Interpretation: Agent uses a broad search strategy (recall-focused)
Low Coverage, High Precision
The agent retrieved little context, but most of it was relevant.
Example: coverage=0.3, precision=0.8
Interpretation: Agent uses a narrow search strategy (precision-focused)
Balanced F1
The agent achieves a good balance between coverage and precision.
Example: coverage=0.7, precision=0.6, f1=0.65
Interpretation: Agent effectively retrieves relevant context without excessive noise
High Redundancy
The agent repeatedly examines the same context.
Example: redundancy=0.6
Interpretation: Inefficient exploration; 60% of viewed context was already seen
Low AUC-Coverage
The agent takes many steps to find relevant context.
Example: auc_coverage=0.3
Interpretation: Relevant context is found late in the trajectory
Next Steps
See Evaluation for running evaluations
Understand the Evaluation Pipeline for how metrics are computed
Explore Metrics API for implementation details