Pattern: Context Window Management

The Problem: Context Collapse

Every LLM has a finite context window. Claude’s is 200K tokens, which sounds enormous — until an agent has been running for 20 minutes, making file reads, running commands, and accumulating conversation history. Without management, the context window fills up and the agent dies.

Turn  1: System prompt (4K) + User message (0.5K)        = 4.5K tokens
Turn  5: + 5 assistant messages + 3 tool results          = 28K tokens
Turn 15: + 15 assistant messages + 40 tool results        = 95K tokens
Turn 25: + 25 assistant messages + 80 tool results        = 175K tokens  ← Danger zone
Turn 30: 💥 Context window exceeded — conversation fails

Core Principles

Claude Code’s context management follows three inviolable rules:

1. Recency is Priority

The most recent messages are the most relevant. Compression always starts from the oldest messages and works forward.

2. Tool Results are Compressible

A 500-line file read result can be summarized as "[Read /src/index.ts: 500 lines, TypeScript module with 12 exports]". The model already processed the content — it just needs a reminder that it existed.

3. System Prompt is Sacred

The system prompt is never compressed, truncated, or modified. It contains identity, capabilities, and behavioral instructions that must remain intact throughout the entire conversation.

graph TB
    subgraph "Context Window (200K tokens)"
        SP["🔒 System Prompt<br/>~4K tokens<br/>NEVER compressed"]
        style SP fill:#fbbf24,stroke:#333

        subgraph "Compressible Zone"
            OLD["Old Messages<br/>Compressed first"]
            MID["Middle Messages<br/>Compressed second"]
            RECENT["Recent Messages<br/>Preserved as-is"]
        end

        style OLD fill:#94a3b8
        style MID fill:#cbd5e1
        style RECENT fill:#4ade80
    end

The Four Layers of Compression

Claude Code implements a progressive 4-layer compression strategy, where each layer is more aggressive than the last. Layers trigger sequentially as the context grows.

Layer 1: Snip (Surgical Truncation)

Trigger: Individual tool results exceed a size threshold (e.g., >10K characters).

Action: Replace the middle portion of large tool results with a [snipped] marker, preserving the beginning and end.

function snipToolResult(content: string, maxChars: number): string {
  if (content.length <= maxChars) return content;

  const headSize = Math.floor(maxChars * 0.3);  // Keep 30% from start
  const tailSize = Math.floor(maxChars * 0.3);  // Keep 30% from end

  const head = content.slice(0, headSize);
  const tail = content.slice(-tailSize);
  const snipped = content.length - headSize - tailSize;

  return `${head}\n\n[... ${snipped} characters snipped ...]\n\n${tail}`;
}

Why keep head AND tail? For code files, the head contains imports and module declarations; the tail contains exports and the most recently-read function. Both are high-value context.

Layer 2: Microcompact (Targeted Summarization)

Trigger: Total context exceeds ~60% of window capacity.

Action: Summarize old tool results into one-line descriptions using a lightweight model call.

interface MicrocompactResult {
  original: ToolResultMessage;
  summary: string;
  tokensSaved: number;
}

async function microcompact(
  toolResult: ToolResultMessage,
  model: LLMClient,
): Promise<MicrocompactResult> {
  const summary = await model.complete({
    system: 'Summarize this tool result in ONE line. Include: tool name, key output, important values.',
    messages: [{ role: 'user', content: toolResult.content }],
    maxTokens: 100,
  });

  return {
    original: toolResult,
    summary: `[Compacted: ${summary}]`,
    tokensSaved: countTokens(toolResult.content) - countTokens(summary),
  };
}

Example:

Before (847 tokens): Full file content of /src/utils/parser.ts
After (23 tokens): [Compacted: ReadFile /src/utils/parser.ts — 245-line TypeScript file, exports parseConfig(), validateSchema(), 3 type definitions]

Layer 3: Auto Compact (Conversation Summarization)

Trigger: Total context exceeds ~80% of window capacity.

Action: Replace the oldest N messages with a single summary message.

async function autoCompact(
  messages: Message[],
  threshold: number,
  model: LLMClient,
): Promise<Message[]> {
  const totalTokens = countTotalTokens(messages);
  if (totalTokens < threshold) return messages;

  // Find how many old messages to compress
  const targetReduction = totalTokens - (threshold * 0.5); // Compress to 50%
  let tokensToCompress = 0;
  let compactUpTo = 0;

  for (let i = 0; i < messages.length; i++) {
    tokensToCompress += countTokens(messages[i]);
    compactUpTo = i;
    if (tokensToCompress >= targetReduction) break;
  }

  // Summarize old messages
  const oldMessages = messages.slice(0, compactUpTo + 1);
  const summary = await model.complete({
    system: `Summarize this conversation segment. Preserve:
      - Decisions made and their rationale
      - Files modified and key changes
      - Current task status and next steps
      - Any errors encountered and resolutions`,
    messages: [{ role: 'user', content: formatMessages(oldMessages) }],
    maxTokens: 1000,
  });

  const summaryMessage: Message = {
    role: 'user',
    content: `[Conversation History Summary]\n${summary}\n[End Summary — recent messages follow]`,
  };

  return [summaryMessage, ...messages.slice(compactUpTo + 1)];
}

Layer 4: Hard Truncation (Emergency Drop)

Trigger: Context exceeds ~95% of window capacity (emergency).

Action: Drop the oldest messages entirely, keeping only the system prompt and recent messages.

function hardTruncate(
  messages: Message[],
  maxTokens: number,
  systemPromptTokens: number,
): Message[] {
  const budget = maxTokens - systemPromptTokens - 1000; // 1K safety margin
  const kept: Message[] = [];
  let usedTokens = 0;

  // Walk backward from most recent
  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = countTokens(messages[i]);
    if (usedTokens + msgTokens > budget) break;
    kept.unshift(messages[i]);
    usedTokens += msgTokens;
  }

  return kept;
}

Progressive Trigger Design

graph LR
    subgraph "Context Usage"
        A["0-40%<br/>No action"] --> B["40-60%<br/>Layer 1: Snip"]
        B --> C["60-80%<br/>Layer 2: Microcompact"]
        C --> D["80-95%<br/>Layer 3: Auto Compact"]
        D --> E["95%+<br/>Layer 4: Hard Truncate"]
    end

    style A fill:#4ade80
    style B fill:#a3e635
    style C fill:#facc15
    style D fill:#fb923c
    style E fill:#ef4444

Threshold Configuration

interface CompressionThresholds {
  // Layer 1: Snip individual results
  snipMaxChars: number;          // Default: 10_000 chars per result

  // Layer 2: Microcompact old results
  microcompactTrigger: number;   // Default: 0.6 (60% of window)

  // Layer 3: Auto compact conversation
  autoCompactTrigger: number;    // Default: 0.8 (80% of window)

  // Layer 4: Emergency truncation
  hardTruncateTrigger: number;   // Default: 0.95 (95% of window)

  // Window capacity
  maxContextTokens: number;      // Default: 200_000 for Claude
}

const defaultThresholds: CompressionThresholds = {
  snipMaxChars: 10_000,
  microcompactTrigger: 0.6,
  autoCompactTrigger: 0.8,
  hardTruncateTrigger: 0.95,
  maxContextTokens: 200_000,
};

The Compression Pipeline

async function manageContext(
  messages: Message[],
  systemPrompt: string,
  thresholds: CompressionThresholds,
  model: LLMClient,
): Promise<Message[]> {
  let managed = [...messages];
  const systemTokens = countTokens(systemPrompt);

  // Layer 1: Snip oversized tool results (always active)
  managed = managed.map(msg => {
    if (isToolResult(msg) && msg.content.length > thresholds.snipMaxChars) {
      return { ...msg, content: snipToolResult(msg.content, thresholds.snipMaxChars) };
    }
    return msg;
  });

  const totalTokens = () => systemTokens + countTotalTokens(managed);
  const usage = () => totalTokens() / thresholds.maxContextTokens;

  // Layer 2: Microcompact old tool results
  if (usage() > thresholds.microcompactTrigger) {
    const cutoff = Math.floor(managed.length * 0.5); // Compact oldest 50%
    for (let i = 0; i < cutoff; i++) {
      if (isToolResult(managed[i]) && !isAlreadyCompacted(managed[i])) {
        const compacted = await microcompact(managed[i], model);
        managed[i] = { ...managed[i], content: compacted.summary };
      }
    }
  }

  // Layer 3: Auto compact conversation
  if (usage() > thresholds.autoCompactTrigger) {
    managed = await autoCompact(managed, thresholds.autoCompactTrigger * thresholds.maxContextTokens, model);
  }

  // Layer 4: Hard truncation (emergency)
  if (usage() > thresholds.hardTruncateTrigger) {
    managed = hardTruncate(managed, thresholds.maxContextTokens, systemTokens);
  }

  return managed;
}

What Gets Preserved vs. Discarded

Content Type	Priority	Compression Behavior
System prompt	🔒 Sacred	Never touched
Last 3 user messages	🔴 Critical	Never compressed
Last assistant message	🔴 Critical	Never compressed
Recent tool results (last 5)	🟡 High	Snipped if oversized
Old tool results	🟢 Low	Microcompacted → dropped
Old assistant messages	🟢 Low	Summarized → dropped
Error messages	🟡 Medium	Preserved longer (debugging value)
File contents (large reads)	🔵 Lowest	First to snip/compact

Reusable Strategy

// ============================================
// Reusable Context Window Manager
// ============================================

interface ContextManager {
  add(message: Message): void;
  getMessages(): Message[];
  getUsage(): { tokens: number; percentage: number };
  compact(): Promise<void>;
}

function createContextManager(
  maxTokens: number,
  model: LLMClient,
  options?: Partial<CompressionThresholds>,
): ContextManager {
  const messages: Message[] = [];
  const thresholds = { ...defaultThresholds, maxContextTokens: maxTokens, ...options };

  return {
    add(message: Message) {
      // Auto-snip on insertion
      if (isToolResult(message) && message.content.length > thresholds.snipMaxChars) {
        message = { ...message, content: snipToolResult(message.content, thresholds.snipMaxChars) };
      }
      messages.push(message);
    },

    getMessages() {
      return [...messages];
    },

    getUsage() {
      const tokens = countTotalTokens(messages);
      return { tokens, percentage: tokens / maxTokens };
    },

    async compact() {
      const managed = await manageContext(messages, '', thresholds, model);
      messages.length = 0;
      messages.push(...managed);
    },
  };
}

Real-World Compression Ratios

From analyzing Claude Code’s behavior in long sessions:

Session length: 45 minutes, 32 turns

Without compression:
  Total tokens accumulated: 287K  ← Would exceed 200K window
  Session would fail at turn ~22

With 4-layer compression:
  Layer 1 (Snip):        287K → 198K  (saved 89K from large file reads)
  Layer 2 (Microcompact): 198K → 142K  (saved 56K from old tool results)
  Layer 3 (Auto Compact): 142K → 89K   (saved 53K from conversation summary)
  Layer 4 (Hard Truncate): Not triggered

  Final context: 89K tokens (44% of window)
  Session completed successfully ✅

Compression Quality vs. Token Savings

graph LR
    subgraph "Quality-Savings Tradeoff"
        S["Snip<br/>Quality: 95%<br/>Savings: 30%"] --> M["Microcompact<br/>Quality: 85%<br/>Savings: 60%"]
        M --> A["Auto Compact<br/>Quality: 70%<br/>Savings: 80%"]
        A --> H["Hard Truncate<br/>Quality: 40%<br/>Savings: 95%"]
    end

    style S fill:#4ade80
    style M fill:#a3e635
    style A fill:#facc15
    style H fill:#ef4444

Each layer trades more quality for more savings. The progressive design means you only pay the quality cost when absolutely necessary.

Applicable Scenarios

Chatbot Systems

Any long-running conversation that accumulates history. Without compression, chat sessions have a hard ceiling.

Agent Frameworks

LangChain, AutoGen, CrewAI — any framework that accumulates tool results needs a compression strategy.

Document Processing

Systems that process large documents in chunks. Each chunk’s output needs eventual compression.

Multi-Turn Reasoning

Complex reasoning tasks (e.g., code review, debugging) that require many back-and-forth turns.

Anti-Patterns

Anti-Pattern	Why It Fails	Better Approach
”Just use a bigger model”	Context windows have hard limits; cost scales quadratically	Compress proactively
Compress everything equally	Recent context is more valuable than old	Progressive, recency-weighted
Compress only when full	By then it’s too late; emergency truncation loses quality	Start at 60% capacity
Never compress system prompt	✅ This is correct	Keep doing it
Summarize with the main model	Expensive; uses the same context you’re trying to save	Use a smaller, faster model