Research

Papers, benchmarks, and articles that shaped the patterns on this site. Each entry is annotated with why it matters.

Foundations

  1. NoLiMa: Long-Context Benchmark

    Why it matters

    The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the “bigger window, better results” assumption that most context strategies rely on.

  2. Context Rot

    Why it matters

    Documents how increasing input tokens impacts LLM performance through systematic “Needle in a Haystack” testing. The data behind the term “context rot” that the patterns on this site address.

  3. Effective Context Engineering for AI Agents

    Why it matters

    Introduces the pyramid approach, the principle of finding the smallest set of high-signal tokens, and the concept of “right altitude” for instructions. Most of the patterns on this site trace back to principles articulated here.

  4. Context Engineering for Agents

    Why it matters

    Defines the four-strategy framework: Write (persist outside the window), Select (pull relevant context in), Compress (summarize and trim), Isolate (separate contexts per agent). Clean taxonomy that maps directly to the patterns here.

  5. How We Built a Multi-Agent Research System

    Why it matters

    Sub-agents with isolated contexts outperformed a single agent, using 15x more tokens total but producing higher quality output. Structured note-taking to NOTES.md for persistence across agent boundaries. The primary case study behind the Isolate pattern.

  6. Code Execution with MCP: Building More Efficient Agents

    Why it matters

    Explains why direct tool calling scales poorly as the number of connected MCP servers grows. Two failure modes: loading all tool definitions upfront bloats the context window, and passing intermediate results back through the model doubles token usage for operations like “fetch this document and attach it elsewhere.” The proposed solution is to expose MCP servers as code APIs in an execution environment rather than as direct tools. The agent writes code to interact with them, processes intermediate data outside the model, and only surfaces the final result. Concrete illustration of why Progressive Disclosure and selective context matter even at the tool-use layer.

  7. ContextBench: A Benchmark for Context Retrieval in Coding Agents

    Why it matters

    Existing coding benchmarks measure whether agents solve the task. ContextBench measures whether they retrieved the right code context along the way. With human-annotated gold contexts and per-step metrics for recall, precision, and efficiency, it exposes a consistent failure mode: agents explore far more context than they actually put to work, and sophisticated scaffolding produces only marginal improvement in retrieval quality. That gap between explored and used context is where selective context strategies pay off.

  8. Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

    Why it matters

    Controlled study isolating where memory-augmented agent failures actually happen. Crossing three write strategies against three retrieval methods, retrieval dominates: accuracy swings 20 points across retrieval methods but only 3-8 points across write strategies. Raw chunking with zero LLM calls matches or beats expensive fact extraction and summarization. Retrieval failure accounts for 11-46% of errors depending on config; utilization failure sits stable at 4-8% regardless. The implication for memory system design: fix retrieval before adding write-time complexity.

Context Degradation

Long Context, Less Focus: A Scaling Gap in LLMs

Large-scale benchmark (PAPerBench, ~29,000 instances across 1k to 256k tokens) with theoretical analysis of attention dilution under context scaling. Finds consistent performance degradation in both personalization and privacy as context length increases. The theoretical contribution matters: proves this is an inherent limitation of soft attention in fixed-capacity Transformers, independent of training data. Reinforces Context Rot with a mechanistic explanation for why it happens.

LOCA-bench: Long-Running Agent Context Rot

First benchmark to test context degradation in long-running agentic scenarios specifically. NoLiMa tests single-step retrieval; LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.

Mitigating Conversational Inertia in Multi-Turn Agents

Attention analysis reveals “conversational inertia”: models develop strong diagonal attention to previous responses as conversation histories grow, causing them to over-weight the most recent turn at the expense of integrating the full context. Directly explains why long-running conversations degrade and why rolling summaries outperform verbatim history at scale. Proposed mitigation introduces contrastive demonstrations that reduce inertia without retraining.

Context-Bench

Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.

How Contexts Fail and How to Fix Them

Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.

Agent Memory & Retrieval

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

An indexed memory mechanism that compresses context without discarding evidence. Rather than lossy summarization, Memex maintains a compact working context of structured summaries with stable indices, while full-fidelity artifacts live in an external store. The agent learns when to dereference an index to recover exact past evidence. Trained via reinforcement learning with reward shaping for memory usage under a context budget. Directly implements several patterns at once: Write Outside the Window for the external store, Compress for the working context, and Progressive Disclosure for the index-then-retrieve loop.

Contextual Retrieval

The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.

MemGPT / Letta

Memory-first agent framework. Treats the LLM like an operating system kernel with managed memory blocks. The architecture that inspired the Write Outside the Window pattern: persistent context with size limits, labels, and access patterns, managed through system calls.

RECOR: Reasoning-Focused Multi-Turn Conversational Retrieval

Benchmark combining multi-turn conversation with reasoning-intensive retrieval, closer to real-world usage than benchmarks that treat the two separately. Highlights a gap in existing evaluation: systems tuned on single-turn retrieval benchmarks perform significantly worse when the conversation history must inform what to retrieve. Useful reference when designing context pipelines for dialogue-heavy applications.

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers

Memory-focused survey organized around a write-manage-read loop and a three-dimensional taxonomy: temporal scope, representational substrate, and control policy. Covers five mechanism families from context-resident compression to policy-learned management. The evaluation section is the most useful part: it traces the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, and identifies gaps current systems haven’t closed.

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

Treats compaction as a failure point inside the agent loop. Slipstream runs summary generation asynchronously while the original agent keeps working, then validates the candidate summary against the agent’s actual continued reasoning. That gives compaction an independent signal instead of trusting the summary to preserve future-relevant facts. Across SWE-bench Verified and BrowseComp, the paper reports up to +8.8 percentage points task accuracy and up to 39.7% lower end-to-end latency. Useful evidence for Compress and Context Handoff: judge summaries by whether the next step still has the facts and intent it needs.

State Contamination in Memory-Augmented LLM Agents

Introduces memory laundering: toxic or adversarial context gets compressed into memory summaries that look safe to ordinary detectors but still shape later behavior. The state-channel framing matters more than the toxicity domain. Raw transcript reuse carries overt contamination; compressed memory can carry hidden, sub-threshold influence. The mitigation result is practical: sanitize unsafe state before summarization, because cleaning only the completed summary can leave the influence intact. Good source for the safety side of persistent memory and compression.

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Asks a better memory retrieval question: does this memory causally improve the next answer? CMI evaluates candidate memories under controlled interventions, then selects context that improves the response while suppressing irrelevant, stale, or harmful memories. The paper introduces Causal-LoCoMo, with useful memories, distractors, and synthetic harmful memories, and compares against vector, graph, reflection, summary, full-history, and no-memory baselines. Useful because it sharpens Select for persistent memory: semantic similarity is not enough when the chosen memory can actively mislead the agent.

Coding Agents & Harnesses

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful?

Counterintuitive finding: across multiple coding agents and LLMs, repository context files (AGENTS.md, .cursorrules) tend to reduce task success rates compared to providing no context, while increasing inference cost by over 20%. Overly detailed context files encourage broader exploration but make tasks harder through unnecessary requirements. The conclusion aligns with Select, Don’t Dump: a few targeted requirements outperform a detailed documentation dump. Human-written context files should describe only minimal requirements.

Structured Context Engineering for File-Native Agentic Systems

The largest empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a “grep tax” where the model spends extra tokens trying to parse unfamiliar structures.

Coding Agents are Effective Long-Context Processors

Takes a different approach to the long-context problem: instead of larger windows or retrieval pipelines, let a coding agent work through the corpus as a file system. Using grep, terminal commands, Python scripts, and intermediate files, off-the-shelf coding agents beat published state-of-the-art by 17.3% on average across five benchmarks spanning 188K to three trillion tokens. The counterintuitive secondary finding: adding explicit retrieval tools to the agent did not help and sometimes degraded performance. Native tool proficiency and file system familiarity covered the retrieval function without a dedicated layer.

Context Engineering for Coding Agents

Practitioner walkthrough of context configuration surfaces across Claude Code, Cursor, and Windsurf. Covers AGENTS.md, .cursorrules, .windsurfrules, memory hooks, and MCP servers as context engineering levers. Particularly useful for the distinction between static context (config files) and dynamic context (memory, tool output). Published on martinfowler.com, which gives it reach beyond the AI-specialist audience.

Effective Harnesses for Long-Running Agents

Detailed account of why context compaction alone is not enough when agents work across multiple sessions. Anthropic’s solution centers on two engineered handoff mechanisms: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress and leaves structured artifacts for the next session. The specific mechanisms are instructive: a progress log for session handoff, a feature list in JSON to prevent premature “done” declarations, git commits as recovery checkpoints, and explicit verification loops before starting new work. The pattern is borrowed directly from how humans hand off engineering work across shifts.

Harness Design for Long-Running Application Development

Shows how Anthropic moved from a session-reset harness for Sonnet 4.5 to a simpler continuous-session harness for Opus 4.6, because the newer model no longer showed the same context anxiety. The practical lesson is good: harnesses encode assumptions about model weaknesses, and those assumptions go stale. It also makes a clean case for separating generator and evaluator agents, with sprint contracts and Playwright-based QA turning vague product quality into inspectable feedback.

Harness Engineering

Synthesis piece on the OpenAI team’s five-month experiment building a codebase maintained entirely by AI agents. Böckeler dissects their approach into three interlocking parts: context engineering (a continuously updated knowledge base plus dynamic runtime context), architectural constraints enforced both by agents and deterministic linters, and periodic cleanup agents that fight entropy. The useful claim is that context quality cannot be separated from code structure and maintenance loops. You cannot engineer context in isolation and expect it to hold.

Agentic Harness Engineering

Turns harness editing into a closed loop rather than a manual prompt-tweaking exercise. AHE gives editable harness components file-level representation, distills trajectory evidence into a drill-down corpus, and pairs every proposed edit with a predicted effect that can be checked later. Ten iterations lift Terminal-Bench 2 pass@1 from 69.7% to 77.0%, above the human-designed Codex-CLI result of 71.9%. The ablation is the useful part: tools, middleware, and long-term memory drive the gains; the system prompt does not.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Moves coding-agent evaluation away from localized edits and toward messy full-stack work. SaaSBench has 30 tasks across 6 SaaS domains, 5,370 validation nodes, 8 programming languages, 6 databases, and 13 frameworks. The headline finding fits the page well: over 95% of failures happen before agents reach deep business logic, usually during setup, integration, or debugging loops. That makes context engineering for coding agents broader than source selection; environment setup, dependencies, and validation surfaces are part of the context system.

SWE-grep and Fast Context

One of the clearest practitioner writeups on coding-agent context retrieval. Cognition reports that agent trajectories were spending more than 60% of the first turn retrieving context, then describes Fast Context as a specialized subagent that returns files and line ranges instead of free-form summaries. The mechanics are concrete: up to 8 parallel tool calls per turn, a maximum of 4 turns, custom grep/read/glob tools, and an RL reward over weighted file and line F1. It is a production answer to ContextBench’s over-exploration problem.

Infrastructure & Tools

Model Context Protocol (MCP)

Standardized protocol for context retrieval. “USB-C for AI.” Adopted by Block, OpenAI, Microsoft. Enables dynamic, information-rich environments rather than static prompts. The protocol layer that makes Progressive Disclosure practical at scale.

Context Lens

Framework-agnostic proxy that intercepts LLM API calls and visualizes context window composition in real time. See what your AI actually sees.

Agent Trace

Open specification for recording AI contributions in version-controlled codebases. Agent Trace tracks files, line ranges, conversations, model identifiers, related resources, and VCS revisions. The context engineering angle is practical: future agents need more than a git diff when they revisit a change; they need the session and reasoning trail behind it. This turns Write Outside the Window into an ecosystem-level artifact, where code provenance becomes retrievable context rather than only authorship metadata.

Case Studies

Context Engineering Case Studies: Etsy-Specific Q&A

How Etsy reduced hallucinations in company-specific question answering through explicit instructions and relevant contextual information. Practical example of the Pyramid pattern applied to enterprise knowledge retrieval.

Codified Context: Infrastructure for AI Agents in a Complex Codebase

Detailed account of context infrastructure built alongside a 108,000-line C# system over 283 development sessions. The architecture splits into three layers: a hot-memory constitution encoding conventions and retrieval hooks, 19 specialized domain agents, and a cold-memory store of 34 on-demand spec documents. Session-level metrics trace how the infrastructure grew and where it prevented failures. The hot/cold memory split is a direct implementation of Write Outside the Window in a long-running production codebase.

Field Maps

A Survey of Context Engineering for Large Language Models

Wide-coverage survey treating context engineering as a formal discipline. Breaks the field into retrieval, generation, processing, and management, then maps how those combine into RAG, memory systems, tool-integrated reasoning, and multi-agent architectures. Best used as a map of the space rather than a source for specific empirical claims. Also surfaces a structural asymmetry the individual benchmarks miss: models handle complex input contexts well but struggle to produce equivalently complex outputs.

The Future of Context Engineering

Maps the three core LLM limitations (context window, reasoning, memory) against the five mechanisms the brain evolved to work with limited working memory: selective attention, chunking, associative retrieval, cognitive offloading, and learning consolidation. The central argument: the brain never evolved bigger working memory; it evolved sharper attention and better retrieval. Worth reading for the framework it offers on which limitations will yield to further scaling and which will need architectural innovation to crack.

LLM-Oriented Information Retrieval

Good field map for retrieval as context engineering. The paper argues that modern IR is increasingly consumed by LLMs rather than humans, which changes the failure mode: irrelevant or misleading results become direct inputs to hallucination and reasoning failure. Its useful vocabulary is “usable evidence density and verifiability within a context window.” That lines up with Select and Progressive Disclosure better than generic RAG advice, because it treats denoising as the central retrieval problem rather than a ranking nicety.