Research

Papers, benchmarks, and articles that shaped the patterns on this site. Each entry is annotated with why it matters.

Foundations

NoLiMa: Long-Context Benchmark

Modarressi et al., Feb 2025

Why it matters

The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the “bigger window, better results” assumption that most context strategies rely on.
Context Rot

Chroma Research

Why it matters

Documents how increasing input tokens impacts LLM performance through systematic “Needle in a Haystack” testing. The data behind the term “context rot” that the patterns on this site address.
Effective Context Engineering for AI Agents

Anthropic

Why it matters

Introduces the pyramid approach, the principle of finding the smallest set of high-signal tokens, and the concept of “right altitude” for instructions. Most of the patterns on this site trace back to principles articulated here.
Context Engineering for Agents

LangChain

Why it matters

Defines the four-strategy framework: Write (persist outside the window), Select (pull relevant context in), Compress (summarize and trim), Isolate (separate contexts per agent). Clean taxonomy that maps directly to the patterns here.
How We Built a Multi-Agent Research System

Anthropic

Why it matters

Sub-agents with isolated contexts outperformed a single agent, using 15x more tokens total but producing higher quality output. Structured note-taking to NOTES.md for persistence across agent boundaries. The primary case study behind the Isolate pattern.
Code Execution with MCP: Building More Efficient Agents

Anthropic, Nov 2025

Why it matters

Explains why direct tool calling scales poorly as the number of connected MCP servers grows. Two failure modes: loading all tool definitions upfront bloats the context window, and passing intermediate results back through the model doubles token usage for operations like “fetch this document and attach it elsewhere.” The proposed solution is to expose MCP servers as code APIs in an execution environment rather than as direct tools. The agent writes code to interact with them, processes intermediate data outside the model, and only surfaces the final result. Concrete illustration of why Progressive Disclosure and selective context matter even at the tool-use layer.
ContextBench: A Benchmark for Context Retrieval in Coding Agents

Li et al., Feb 2026 · 1,136 tasks · 66 repos · 8 languages

Why it matters

Existing coding benchmarks measure whether agents solve the task. ContextBench measures whether they retrieved the right code context along the way. With human-annotated gold contexts and per-step metrics for recall, precision, and efficiency, it exposes a consistent failure mode: agents explore far more context than they actually put to work, and sophisticated scaffolding produces only marginal improvement in retrieval quality. That gap between explored and used context is where selective context strategies pay off.
Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Yuan et al., Mar 2026 · ICLR MemAgents Workshop · 3x3 study

Why it matters

Controlled study isolating where memory-augmented agent failures actually happen. Crossing three write strategies against three retrieval methods, retrieval dominates: accuracy swings 20 points across retrieval methods but only 3-8 points across write strategies. Raw chunking with zero LLM calls matches or beats expensive fact extraction and summarization. Retrieval failure accounts for 11-46% of errors depending on config; utilization failure sits stable at 4-8% regardless. The implication for memory system design: fix retrieval before adding write-time complexity.

Recent Additions

ACC: Compiling Agent Trajectories for Long-Context Training

Su et al., May 2026

Turns agent traces into long-context training examples. Standard agent SFT masks tool responses, which means the model is trained to choose the next tool but not to integrate the evidence scattered across prior observations. ACC compiles those trajectories into QA pairs where the question and the collected tool outputs sit in one long context. The results are worth watching: Qwen3-30B-A3B trained with ACC gained +18.1 on MRCR and +7.6 on GraphWalks, approaching the much larger Qwen3-235B-A22B. The context engineering angle is direct: tool traces can become supervision data for long-range context use.

Code as Agent Harness

Ning et al., May 2026

Survey framing code as the operational substrate of agents. The paper organizes the space around harness interfaces, harness mechanisms, and multi-agent scaling, then treats planning, memory, tool use, environment modeling, and execution-based verification as pieces of the same harness problem. Useful because it names the engineering layer that contextpatterns.com keeps circling: durable state, executable tools, and verifiable feedback loops are how context moves out of prose prompts and into systems.

From History to State: Constant-Context Skill Learning for LLM Agents

Xie et al., May 2026

Moves recurring procedural context out of the prompt and into lightweight task-family modules. At inference time the agent conditions on the current observation plus a compact state block, rather than carrying a long ReAct history and repeated skill instructions. Across ALFWorld, WebShop, and SciWorld, the approach reduces prompt tokens per turn by 2-7x while matching or exceeding strong agent-training baselines. This is a useful counterpoint to ever-larger context windows: some context should not be compressed or retrieved; it should be learned away.

A Language for Describing Agentic LLM Contexts

Pelc, Kaminka, and Goldberg, May 2026 · CAIS '26

Introduces ACDL, a notation for describing how an agent’s context is assembled and how it changes across turns. Most context engineering is still explained with prose, screenshots, or source code inspection, all of which hide the actual dynamics. ACDL gives names to role message sequences, dynamic content, time-indexed references, conditional structure, and iterative context assembly. If the notation catches on, it could become the missing design artifact between prompt text and implementation code.

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

Kim, May 2026

Argues that better software engineering agents need data that captures where engineering context actually comes from: human-human conversations, human-AI sessions, and the surrounding multi-week project work. Solo coding traces miss the product decisions, tradeoffs, and ambiguous requirements that senior engineers rely on. The useful point for context engineering is uncomfortable but probably right: if the relevant context is formed outside the agent session, no amount of transcript polishing inside the session will fully recover it.

An AI Agent Execution Environment to Safeguard User Data

Stanley et al., Apr 2026

GAAP applies information-flow control to agent execution, tracking how private data is accessed and where it may be disclosed across both single tasks and later tasks. The important context engineering lesson is that context access and data release cannot be governed by prompt instructions alone. If an agent can see private data, prompt injection can try to route it somewhere else. GAAP makes the permission model part of the execution environment, which is where this control belongs.

Scaling Managed Agents: Decoupling the Brain from the Hands

Anthropic, Apr 2026

Managed Agents splits the agent into a durable session log, a stateless harness, and one or more sandboxes or tools. The strongest line in the piece is that the session is not Claude’s context window: the full event stream persists outside the model, and the harness decides which slices or transformations enter the next call. That is Write Outside the Window turned into platform architecture. The reported p50 time-to-first-token drop of roughly 60%, and p95 drop of over 90%, also show that context architecture affects latency as much as answer quality.

An Update on Recent Claude Code Quality Reports

Anthropic, Apr 2026

A rare public postmortem of context management failures in a production coding agent. The most relevant bug cleared older thinking blocks after an idle session, but kept doing it on every subsequent turn, making Claude forget why it had chosen prior edits and tool calls. A separate system prompt change that forced very short text between tool calls caused a measurable quality drop. Useful because it proves context engineering failures can come from cache optimizations, stale-session handling, or one line in a system prompt.

Context Degradation

Long Context, Less Focus: A Scaling Gap in LLMs

Gu, Feb 2026 · 377k evaluation questions

Large-scale benchmark (PAPerBench, ~29,000 instances across 1k to 256k tokens) with theoretical analysis of attention dilution under context scaling. Finds consistent performance degradation in both personalization and privacy as context length increases. The theoretical contribution matters: proves this is an inherent limitation of soft attention in fixed-capacity Transformers, independent of training data. Reinforces Context Rot with a mechanistic explanation for why it happens.

LOCA-bench: Long-Running Agent Context Rot

Zeng et al. (HKUST-NLP), Feb 2026

First benchmark to test context degradation in long-running agentic scenarios specifically. NoLiMa tests single-step retrieval; LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.

Mitigating Conversational Inertia in Multi-Turn Agents

Wan et al., Feb 2026

Attention analysis reveals “conversational inertia”: models develop strong diagonal attention to previous responses as conversation histories grow, causing them to over-weight the most recent turn at the expense of integrating the full context. Directly explains why long-running conversations degrade and why rolling summaries outperform verbatim history at scale. Proposed mitigation introduces contrastive demonstrations that reduce inertia without retraining.

Context-Bench

Letta

Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.

How Contexts Fail and How to Fix Them

Drew Breunig

Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.

Agent Memory & Retrieval

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Wang et al., March 2026

An indexed memory mechanism that compresses context without discarding evidence. Rather than lossy summarization, Memex maintains a compact working context of structured summaries with stable indices, while full-fidelity artifacts live in an external store. The agent learns when to dereference an index to recover exact past evidence. Trained via reinforcement learning with reward shaping for memory usage under a context budget. Directly implements several patterns at once: Write Outside the Window for the external store, Compress for the working context, and Progressive Disclosure for the index-then-retrieve loop.

Contextual Retrieval

Anthropic

The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.

MemGPT / Letta

Letta

Memory-first agent framework. Treats the LLM like an operating system kernel with managed memory blocks. The architecture that inspired the Write Outside the Window pattern: persistent context with size limits, labels, and access patterns, managed through system calls.

RECOR: Reasoning-Focused Multi-Turn Conversational Retrieval

arXiv, Jan 2026

Benchmark combining multi-turn conversation with reasoning-intensive retrieval, closer to real-world usage than benchmarks that treat the two separately. Highlights a gap in existing evaluation: systems tuned on single-turn retrieval benchmarks perform significantly worse when the conversation history must inform what to retrieve. Useful reference when designing context pipelines for dialogue-heavy applications.

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers

Mar 2026 · 2022-2026 coverage

Memory-focused survey organized around a write-manage-read loop and a three-dimensional taxonomy: temporal scope, representational substrate, and control policy. Covers five mechanism families from context-resident compression to policy-learned management. The evaluation section is the most useful part: it traces the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, and identifies gaps current systems haven’t closed.

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

Chen et al., May 2026

Treats compaction as a failure point inside the agent loop. Slipstream runs summary generation asynchronously while the original agent keeps working, then validates the candidate summary against the agent’s actual continued reasoning. That gives compaction an independent signal instead of trusting the summary to preserve future-relevant facts. Across SWE-bench Verified and BrowseComp, the paper reports up to +8.8 percentage points task accuracy and up to 39.7% lower end-to-end latency. Useful evidence for Compress and Context Handoff: judge summaries by whether the next step still has the facts and intent it needs.

State Contamination in Memory-Augmented LLM Agents

Wang et al., May 2026

Introduces memory laundering: toxic or adversarial context gets compressed into memory summaries that look safe to ordinary detectors but still shape later behavior. The state-channel framing matters more than the toxicity domain. Raw transcript reuse carries overt contamination; compressed memory can carry hidden, sub-threshold influence. The mitigation result is practical: sanitize unsafe state before summarization, because cleaning only the completed summary can leave the influence intact. Good source for the safety side of persistent memory and compression.

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Srivastava, May 2026

Asks a better memory retrieval question: does this memory causally improve the next answer? CMI evaluates candidate memories under controlled interventions, then selects context that improves the response while suppressing irrelevant, stale, or harmful memories. The paper introduces Causal-LoCoMo, with useful memories, distractors, and synthetic harmful memories, and compares against vector, graph, reflection, summary, full-history, and no-memory baselines. Useful because it sharpens Select for persistent memory: semantic similarity is not enough when the chosen memory can actively mislead the agent.

Coding Agents & Harnesses

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful?

Gloaguen et al. (ETH Zurich), Feb 2026

Counterintuitive finding: across multiple coding agents and LLMs, repository context files (AGENTS.md, .cursorrules) tend to reduce task success rates compared to providing no context, while increasing inference cost by over 20%. Overly detailed context files encourage broader exploration but make tasks harder through unnecessary requirements. The conclusion aligns with Select, Don’t Dump: a few targeted requirements outperform a detailed documentation dump. Human-written context files should describe only minimal requirements.

Structured Context Engineering for File-Native Agentic Systems

McMillan, Feb 2026 · 9,649 experiments

The largest empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a “grep tax” where the model spends extra tokens trying to parse unfamiliar structures.

Coding Agents are Effective Long-Context Processors

Cao et al., Mar 2026 · 5 benchmarks · 188K-3T tokens

Takes a different approach to the long-context problem: instead of larger windows or retrieval pipelines, let a coding agent work through the corpus as a file system. Using grep, terminal commands, Python scripts, and intermediate files, off-the-shelf coding agents beat published state-of-the-art by 17.3% on average across five benchmarks spanning 188K to three trillion tokens. The counterintuitive secondary finding: adding explicit retrieval tools to the agent did not help and sometimes degraded performance. Native tool proficiency and file system familiarity covered the retrieval function without a dedicated layer.

Context Engineering for Coding Agents

Birgitta Böckeler (Thoughtworks), Feb 2026

Practitioner walkthrough of context configuration surfaces across Claude Code, Cursor, and Windsurf. Covers AGENTS.md, .cursorrules, .windsurfrules, memory hooks, and MCP servers as context engineering levers. Particularly useful for the distinction between static context (config files) and dynamic context (memory, tool output). Published on martinfowler.com, which gives it reach beyond the AI-specialist audience.

Effective Harnesses for Long-Running Agents

Anthropic, Nov 2025

Detailed account of why context compaction alone is not enough when agents work across multiple sessions. Anthropic’s solution centers on two engineered handoff mechanisms: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress and leaves structured artifacts for the next session. The specific mechanisms are instructive: a progress log for session handoff, a feature list in JSON to prevent premature “done” declarations, git commits as recovery checkpoints, and explicit verification loops before starting new work. The pattern is borrowed directly from how humans hand off engineering work across shifts.

Harness Design for Long-Running Application Development

Anthropic, Mar 2026

Shows how Anthropic moved from a session-reset harness for Sonnet 4.5 to a simpler continuous-session harness for Opus 4.6, because the newer model no longer showed the same context anxiety. The practical lesson is good: harnesses encode assumptions about model weaknesses, and those assumptions go stale. It also makes a clean case for separating generator and evaluator agents, with sprint contracts and Playwright-based QA turning vague product quality into inspectable feedback.

Harness Engineering

Birgitta Böckeler (Thoughtworks), Feb 2026

Synthesis piece on the OpenAI team’s five-month experiment building a codebase maintained entirely by AI agents. Böckeler dissects their approach into three interlocking parts: context engineering (a continuously updated knowledge base plus dynamic runtime context), architectural constraints enforced both by agents and deterministic linters, and periodic cleanup agents that fight entropy. The useful claim is that context quality cannot be separated from code structure and maintenance loops. You cannot engineer context in isolation and expect it to hold.

Agentic Harness Engineering

Lin et al., Apr 2026, revised May 2026

Turns harness editing into a closed loop rather than a manual prompt-tweaking exercise. AHE gives editable harness components file-level representation, distills trajectory evidence into a drill-down corpus, and pairs every proposed edit with a predicted effect that can be checked later. Ten iterations lift Terminal-Bench 2 pass@1 from 69.7% to 77.0%, above the human-designed Codex-CLI result of 71.9%. The ablation is the useful part: tools, middleware, and long-term memory drive the gains; the system prompt does not.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Ren et al., May 2026

Moves coding-agent evaluation away from localized edits and toward messy full-stack work. SaaSBench has 30 tasks across 6 SaaS domains, 5,370 validation nodes, 8 programming languages, 6 databases, and 13 frameworks. The headline finding fits the page well: over 95% of failures happen before agents reach deep business logic, usually during setup, integration, or debugging loops. That makes context engineering for coding agents broader than source selection; environment setup, dependencies, and validation surfaces are part of the context system.

SWE-grep and Fast Context

Cognition, Oct 2025

One of the clearest practitioner writeups on coding-agent context retrieval. Cognition reports that agent trajectories were spending more than 60% of the first turn retrieving context, then describes Fast Context as a specialized subagent that returns files and line ranges instead of free-form summaries. The mechanics are concrete: up to 8 parallel tool calls per turn, a maximum of 4 turns, custom grep/read/glob tools, and an RL reward over weighted file and line F1. It is a production answer to ContextBench’s over-exploration problem.

Infrastructure & Tools

Model Context Protocol (MCP)

Anthropic

Standardized protocol for context retrieval. “USB-C for AI.” Adopted by Block, OpenAI, Microsoft. Enables dynamic, information-rich environments rather than static prompts. The protocol layer that makes Progressive Disclosure practical at scale.

Context Lens

Open Source

Framework-agnostic proxy that intercepts LLM API calls and visualizes context window composition in real time. See what your AI actually sees.

Agent Trace

Open specification, Jan 2026

Open specification for recording AI contributions in version-controlled codebases. Agent Trace tracks files, line ranges, conversations, model identifiers, related resources, and VCS revisions. The context engineering angle is practical: future agents need more than a git diff when they revisit a change; they need the session and reasoning trail behind it. This turns Write Outside the Window into an ecosystem-level artifact, where code provenance becomes retrievable context rather than only authorship metadata.

Case Studies

Context Engineering Case Studies: Etsy-Specific Q&A

Etsy Engineering

How Etsy reduced hallucinations in company-specific question answering through explicit instructions and relevant contextual information. Practical example of the Pyramid pattern applied to enterprise knowledge retrieval.

Codified Context: Infrastructure for AI Agents in a Complex Codebase

Vasilopoulos, Feb 2026 · 108,000 lines · 283 sessions

Detailed account of context infrastructure built alongside a 108,000-line C# system over 283 development sessions. The architecture splits into three layers: a hot-memory constitution encoding conventions and retrieval hooks, 19 specialized domain agents, and a cold-memory store of 34 on-demand spec documents. Session-level metrics trace how the infrastructure grew and where it prevented failures. The hot/cold memory split is a direct implementation of Write Outside the Window in a long-running production codebase.

Field Maps

A Survey of Context Engineering for Large Language Models

Mei et al., Jul 2025 · 1,400+ papers

Wide-coverage survey treating context engineering as a formal discipline. Breaks the field into retrieval, generation, processing, and management, then maps how those combine into RAG, memory systems, tool-integrated reasoning, and multi-agent architectures. Best used as a map of the space rather than a source for specific empirical claims. Also surfaces a structural asymmetry the individual benchmarks miss: models handle complex input contexts well but struggle to produce equivalently complex outputs.

The Future of Context Engineering

Mart van der Jagt, Mar 2026

Maps the three core LLM limitations (context window, reasoning, memory) against the five mechanisms the brain evolved to work with limited working memory: selective attention, chunking, associative retrieval, cognitive offloading, and learning consolidation. The central argument: the brain never evolved bigger working memory; it evolved sharper attention and better retrieval. Worth reading for the framework it offers on which limitations will yield to further scaling and which will need architectural innovation to crack.

LLM-Oriented Information Retrieval

Dai et al., May 2026 · SIGIR 2026

Good field map for retrieval as context engineering. The paper argues that modern IR is increasingly consumed by LLMs rather than humans, which changes the failure mode: irrelevant or misleading results become direct inputs to hallucination and reasoning failure. Its useful vocabulary is “usable evidence density and verifiability within a context window.” That lines up with Select and Progressive Disclosure better than generic RAG advice, because it treats denoising as the central retrieval problem rather than a ranking nicety.