Evaluation Pipeline =================== .. image:: ../assets/figures/pipeline.png :align: center :width: 100% :alt: ContextBench Pipeline `Download full pipeline diagram (PDF) <../assets/figures/pipeline.pdf>`_ ---- Overview -------- ContextBench evaluates agents through a systematic pipeline that compares **predicted context** (from agent trajectories) against **gold annotations** (human-verified context) at multiple granularities. Pipeline Steps -------------- 1. Trajectory Extraction ~~~~~~~~~~~~~~~~~~~~~~~~ Extract file views and spans from agent trajectories: - Parse trajectory format (``.traj.json``, ``.checkpoints.jsonl``, etc.) - Identify file access commands (``cat``, ``view``, ``grep``, etc.) - Extract line ranges and byte spans for each viewed file - Track per-step context evolution 2. Repository Checkout ~~~~~~~~~~~~~~~~~~~~~~ Clone and checkout the target repository: - Use cached clones when available - Checkout the specific commit/version for the task - Verify repository state matches gold annotations 3. Symbol Extraction ~~~~~~~~~~~~~~~~~~~~ Use tree-sitter to extract code symbols: - Parse files to build abstract syntax trees (AST) - Identify definitions: classes, functions, methods - Map byte spans to symbol boundaries - Handle 8 programming languages: Python, Java, JavaScript, TypeScript, Go, Rust, C, C++ 4. Gold Context Loading ~~~~~~~~~~~~~~~~~~~~~~~~ Load human-annotated gold context: - File-level: Set of relevant file paths - Symbol-level: Set of relevant class/function definitions - Span-level: Byte intervals of relevant code regions - EditLoc-level: Locations where edits were made 5. Metric Computation ~~~~~~~~~~~~~~~~~~~~~ Compute metrics at multiple granularities: **Set-based metrics** (File, Symbol): .. math:: \text{Coverage} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Gold}|} \text{Precision} = \frac{|\text{Gold} \cap \text{Pred}|}{|\text{Pred}|} \text{F1} = 2 \cdot \frac{\text{Coverage} \cdot \text{Precision}}{\text{Coverage} + \text{Precision}} **Interval-based metrics** (Span): .. math:: \text{Coverage} = \frac{\text{bytes}(\text{Gold} \cap \text{Pred})}{\text{bytes}(\text{Gold})} \text{Precision} = \frac{\text{bytes}(\text{Gold} \cap \text{Pred})}{\text{bytes}(\text{Pred})} **Trajectory metrics**: - **AUC-Coverage**: Area under the per-step coverage curve (measures efficiency) - **Redundancy**: Fraction of context re-examined across steps 6. Result Aggregation ~~~~~~~~~~~~~~~~~~~~~ Output comprehensive evaluation: - Per-instance metrics (JSON Lines format) - Aggregated statistics (macro/micro averages) - Trajectory visualization data Granularity Levels ------------------ File-Level ~~~~~~~~~~ Evaluates which files were retrieved: - **Gold**: Set of file paths that contain relevant context - **Pred**: Set of files the agent viewed - **Use case**: Coarse-grained context retrieval Symbol-Level ~~~~~~~~~~~~ Evaluates which code symbols (functions, classes) were identified: - **Gold**: Set of (file, symbol_name) tuples with relevant definitions - **Pred**: Symbols covered by viewed spans - **Use case**: Fine-grained context understanding Span-Level ~~~~~~~~~~ Evaluates exact byte ranges of viewed code: - **Gold**: Union of byte intervals containing relevant code - **Pred**: Union of byte intervals the agent examined - **Use case**: Most precise retrieval measurement EditLoc-Level ~~~~~~~~~~~~~ Evaluates edit location prediction: - **Gold**: Lines where the ground-truth patch made changes - **Pred**: Lines the agent identified for editing - **Use case**: Measures localization accuracy Supported Formats ----------------- ContextBench automatically detects and parses multiple trajectory formats: - **MiniSWE-agent**: ``.traj.json`` files - **SWE-agent**: ``.checkpoints.jsonl`` files - **Agentless**: Custom JSON format - **OpenHands**: Trajectory logs - **Prometheus**: Agent-specific format See :doc:`agents` for details on each format. Next Steps ---------- - Understand the :doc:`metrics` in detail - Learn about :doc:`agents` trajectory formats - See :doc:`evaluation` for advanced usage