Recent Additions
Su et al., May 2026
Turns agent traces into long-context training examples. Standard agent SFT masks tool responses, which means the model is trained to choose the next tool but not to integrate the evidence scattered across prior observations. ACC compiles those trajectories into QA pairs where the question and the collected tool outputs sit in one long context. The results are worth watching: Qwen3-30B-A3B trained with ACC gained +18.1 on MRCR and +7.6 on GraphWalks, approaching the much larger Qwen3-235B-A22B. The context engineering angle is direct: tool traces can become supervision data for long-range context use.
Ning et al., May 2026
Survey framing code as the operational substrate of agents. The paper organizes the space around harness interfaces, harness mechanisms, and multi-agent scaling, then treats planning, memory, tool use, environment modeling, and execution-based verification as pieces of the same harness problem. Useful because it names the engineering layer that contextpatterns.com keeps circling: durable state, executable tools, and verifiable feedback loops are how context moves out of prose prompts and into systems.
Xie et al., May 2026
Moves recurring procedural context out of the prompt and into lightweight task-family modules. At inference time the agent conditions on the current observation plus a compact state block, rather than carrying a long ReAct history and repeated skill instructions. Across ALFWorld, WebShop, and SciWorld, the approach reduces prompt tokens per turn by 2-7x while matching or exceeding strong agent-training baselines. This is a useful counterpoint to ever-larger context windows: some context should not be compressed or retrieved; it should be learned away.
Pelc, Kaminka, and Goldberg, May 2026 · CAIS '26
Introduces ACDL, a notation for describing how an agent’s context is assembled and how it changes across turns. Most context engineering is still explained with prose, screenshots, or source code inspection, all of which hide the actual dynamics. ACDL gives names to role message sequences, dynamic content, time-indexed references, conditional structure, and iterative context assembly. If the notation catches on, it could become the missing design artifact between prompt text and implementation code.
Kim, May 2026
Argues that better software engineering agents need data that captures where engineering context actually comes from: human-human conversations, human-AI sessions, and the surrounding multi-week project work. Solo coding traces miss the product decisions, tradeoffs, and ambiguous requirements that senior engineers rely on. The useful point for context engineering is uncomfortable but probably right: if the relevant context is formed outside the agent session, no amount of transcript polishing inside the session will fully recover it.
Stanley et al., Apr 2026
GAAP applies information-flow control to agent execution, tracking how private data is accessed and where it may be disclosed across both single tasks and later tasks. The important context engineering lesson is that context access and data release cannot be governed by prompt instructions alone. If an agent can see private data, prompt injection can try to route it somewhere else. GAAP makes the permission model part of the execution environment, which is where this control belongs.
Anthropic, Apr 2026
Managed Agents splits the agent into a durable session log, a stateless harness, and one or more sandboxes or tools. The strongest line in the piece is that the session is not Claude’s context window: the full event stream persists outside the model, and the harness decides which slices or transformations enter the next call. That is Write Outside the Window turned into platform architecture. The reported p50 time-to-first-token drop of roughly 60%, and p95 drop of over 90%, also show that context architecture affects latency as much as answer quality.
Anthropic, Apr 2026
A rare public postmortem of context management failures in a production coding agent. The most relevant bug cleared older thinking blocks after an idle session, but kept doing it on every subsequent turn, making Claude forget why it had chosen prior edits and tool calls. A separate system prompt change that forced very short text between tool calls caused a measurable quality drop. Useful because it proves context engineering failures can come from cache optimizations, stale-session handling, or one line in a system prompt.
Context Degradation
Gu, Feb 2026 · 377k evaluation questions
Large-scale benchmark (PAPerBench, ~29,000 instances across 1k to 256k tokens) with theoretical analysis of attention dilution under context scaling. Finds consistent performance degradation in both personalization and privacy as context length increases. The theoretical contribution matters: proves this is an inherent limitation of soft attention in fixed-capacity Transformers, independent of training data. Reinforces Context Rot with a mechanistic explanation for why it happens.
Zeng et al. (HKUST-NLP), Feb 2026
First benchmark to test context degradation in long-running agentic scenarios specifically. NoLiMa tests single-step retrieval; LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.
Wan et al., Feb 2026
Attention analysis reveals “conversational inertia”: models develop strong diagonal attention to previous responses as conversation histories grow, causing them to over-weight the most recent turn at the expense of integrating the full context. Directly explains why long-running conversations degrade and why rolling summaries outperform verbatim history at scale. Proposed mitigation introduces contrastive demonstrations that reduce inertia without retraining.
Letta
Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.
Drew Breunig
Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.
Agent Memory & Retrieval
Wang et al., March 2026
An indexed memory mechanism that compresses context without discarding evidence. Rather than lossy summarization, Memex maintains a compact working context of structured summaries with stable indices, while full-fidelity artifacts live in an external store. The agent learns when to dereference an index to recover exact past evidence. Trained via reinforcement learning with reward shaping for memory usage under a context budget. Directly implements several patterns at once: Write Outside the Window for the external store, Compress for the working context, and Progressive Disclosure for the index-then-retrieve loop.
Anthropic
The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.
Letta
Memory-first agent framework. Treats the LLM like an operating system kernel with managed memory blocks. The architecture that inspired the Write Outside the Window pattern: persistent context with size limits, labels, and access patterns, managed through system calls.
arXiv, Jan 2026
Benchmark combining multi-turn conversation with reasoning-intensive retrieval, closer to real-world usage than benchmarks that treat the two separately. Highlights a gap in existing evaluation: systems tuned on single-turn retrieval benchmarks perform significantly worse when the conversation history must inform what to retrieve. Useful reference when designing context pipelines for dialogue-heavy applications.
Mar 2026 · 2022-2026 coverage
Memory-focused survey organized around a write-manage-read loop and a three-dimensional taxonomy: temporal scope, representational substrate, and control policy. Covers five mechanism families from context-resident compression to policy-learned management. The evaluation section is the most useful part: it traces the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, and identifies gaps current systems haven’t closed.
Chen et al., May 2026
Treats compaction as a failure point inside the agent loop. Slipstream runs summary generation asynchronously while the original agent keeps working, then validates the candidate summary against the agent’s actual continued reasoning. That gives compaction an independent signal instead of trusting the summary to preserve future-relevant facts. Across SWE-bench Verified and BrowseComp, the paper reports up to +8.8 percentage points task accuracy and up to 39.7% lower end-to-end latency. Useful evidence for Compress and Context Handoff: judge summaries by whether the next step still has the facts and intent it needs.
Wang et al., May 2026
Introduces memory laundering: toxic or adversarial context gets compressed into memory summaries that look safe to ordinary detectors but still shape later behavior. The state-channel framing matters more than the toxicity domain. Raw transcript reuse carries overt contamination; compressed memory can carry hidden, sub-threshold influence. The mitigation result is practical: sanitize unsafe state before summarization, because cleaning only the completed summary can leave the influence intact. Good source for the safety side of persistent memory and compression.
Srivastava, May 2026
Asks a better memory retrieval question: does this memory causally improve the next answer? CMI evaluates candidate memories under controlled interventions, then selects context that improves the response while suppressing irrelevant, stale, or harmful memories. The paper introduces Causal-LoCoMo, with useful memories, distractors, and synthetic harmful memories, and compares against vector, graph, reflection, summary, full-history, and no-memory baselines. Useful because it sharpens Select for persistent memory: semantic similarity is not enough when the chosen memory can actively mislead the agent.
Coding Agents & Harnesses
Gloaguen et al. (ETH Zurich), Feb 2026
Counterintuitive finding: across multiple coding agents and LLMs, repository context files (AGENTS.md, .cursorrules) tend to reduce task success rates compared to providing no context, while increasing inference cost by over 20%. Overly detailed context files encourage broader exploration but make tasks harder through unnecessary requirements. The conclusion aligns with Select, Don’t Dump: a few targeted requirements outperform a detailed documentation dump. Human-written context files should describe only minimal requirements.
McMillan, Feb 2026 · 9,649 experiments
The largest empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a “grep tax” where the model spends extra tokens trying to parse unfamiliar structures.
Cao et al., Mar 2026 · 5 benchmarks · 188K-3T tokens
Takes a different approach to the long-context problem: instead of larger windows or retrieval pipelines, let a coding agent work through the corpus as a file system. Using grep, terminal commands, Python scripts, and intermediate files, off-the-shelf coding agents beat published state-of-the-art by 17.3% on average across five benchmarks spanning 188K to three trillion tokens. The counterintuitive secondary finding: adding explicit retrieval tools to the agent did not help and sometimes degraded performance. Native tool proficiency and file system familiarity covered the retrieval function without a dedicated layer.
Birgitta Böckeler (Thoughtworks), Feb 2026
Practitioner walkthrough of context configuration surfaces across Claude Code, Cursor, and Windsurf. Covers AGENTS.md, .cursorrules, .windsurfrules, memory hooks, and MCP servers as context engineering levers. Particularly useful for the distinction between static context (config files) and dynamic context (memory, tool output). Published on martinfowler.com, which gives it reach beyond the AI-specialist audience.
Anthropic, Nov 2025
Detailed account of why context compaction alone is not enough when agents work across multiple sessions. Anthropic’s solution centers on two engineered handoff mechanisms: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress and leaves structured artifacts for the next session. The specific mechanisms are instructive: a progress log for session handoff, a feature list in JSON to prevent premature “done” declarations, git commits as recovery checkpoints, and explicit verification loops before starting new work. The pattern is borrowed directly from how humans hand off engineering work across shifts.
Anthropic, Mar 2026
Shows how Anthropic moved from a session-reset harness for Sonnet 4.5 to a simpler continuous-session harness for Opus 4.6, because the newer model no longer showed the same context anxiety. The practical lesson is good: harnesses encode assumptions about model weaknesses, and those assumptions go stale. It also makes a clean case for separating generator and evaluator agents, with sprint contracts and Playwright-based QA turning vague product quality into inspectable feedback.
Birgitta Böckeler (Thoughtworks), Feb 2026
Synthesis piece on the OpenAI team’s five-month experiment building a codebase maintained entirely by AI agents. Böckeler dissects their approach into three interlocking parts: context engineering (a continuously updated knowledge base plus dynamic runtime context), architectural constraints enforced both by agents and deterministic linters, and periodic cleanup agents that fight entropy. The useful claim is that context quality cannot be separated from code structure and maintenance loops. You cannot engineer context in isolation and expect it to hold.
Lin et al., Apr 2026, revised May 2026
Turns harness editing into a closed loop rather than a manual prompt-tweaking exercise. AHE gives editable harness components file-level representation, distills trajectory evidence into a drill-down corpus, and pairs every proposed edit with a predicted effect that can be checked later. Ten iterations lift Terminal-Bench 2 pass@1 from 69.7% to 77.0%, above the human-designed Codex-CLI result of 71.9%. The ablation is the useful part: tools, middleware, and long-term memory drive the gains; the system prompt does not.
Ren et al., May 2026
Moves coding-agent evaluation away from localized edits and toward messy full-stack work. SaaSBench has 30 tasks across 6 SaaS domains, 5,370 validation nodes, 8 programming languages, 6 databases, and 13 frameworks. The headline finding fits the page well: over 95% of failures happen before agents reach deep business logic, usually during setup, integration, or debugging loops. That makes context engineering for coding agents broader than source selection; environment setup, dependencies, and validation surfaces are part of the context system.
Cognition, Oct 2025
One of the clearest practitioner writeups on coding-agent context retrieval. Cognition reports that agent trajectories were spending more than 60% of the first turn retrieving context, then describes Fast Context as a specialized subagent that returns files and line ranges instead of free-form summaries. The mechanics are concrete: up to 8 parallel tool calls per turn, a maximum of 4 turns, custom grep/read/glob tools, and an RL reward over weighted file and line F1. It is a production answer to ContextBench’s over-exploration problem.
Field Maps
Mei et al., Jul 2025 · 1,400+ papers
Wide-coverage survey treating context engineering as a formal discipline. Breaks the field into retrieval, generation, processing, and management, then maps how those combine into RAG, memory systems, tool-integrated reasoning, and multi-agent architectures. Best used as a map of the space rather than a source for specific empirical claims. Also surfaces a structural asymmetry the individual benchmarks miss: models handle complex input contexts well but struggle to produce equivalently complex outputs.
Mart van der Jagt, Mar 2026
Maps the three core LLM limitations (context window, reasoning, memory) against the five mechanisms the brain evolved to work with limited working memory: selective attention, chunking, associative retrieval, cognitive offloading, and learning consolidation. The central argument: the brain never evolved bigger working memory; it evolved sharper attention and better retrieval. Worth reading for the framework it offers on which limitations will yield to further scaling and which will need architectural innovation to crack.
Dai et al., May 2026 · SIGIR 2026
Good field map for retrieval as context engineering. The paper argues that modern IR is increasingly consumed by LLMs rather than humans, which changes the failure mode: irrelevant or misleading results become direct inputs to hallucination and reasoning failure. Its useful vocabulary is “usable evidence density and verifiability within a context window.” That lines up with Select and Progressive Disclosure better than generic RAG advice, because it treats denoising as the central retrieval problem rather than a ranking nicety.