Skip to content

Pattern: Context Window Management

Every LLM has a finite context window. Claude’s is 200K tokens, which sounds enormous — until an agent has been running for 20 minutes, making file reads, running commands, and accumulating conversation history. Without management, the context window fills up and the agent dies.

Turn 1: System prompt (4K) + User message (0.5K) = 4.5K tokens
Turn 5: + 5 assistant messages + 3 tool results = 28K tokens
Turn 15: + 15 assistant messages + 40 tool results = 95K tokens
Turn 25: + 25 assistant messages + 80 tool results = 175K tokens ← Danger zone
Turn 30: 💥 Context window exceeded — conversation fails

Claude Code’s context management follows three inviolable rules:

The most recent messages are the most relevant. Compression always starts from the oldest messages and works forward.

A 500-line file read result can be summarized as "[Read /src/index.ts: 500 lines, TypeScript module with 12 exports]". The model already processed the content — it just needs a reminder that it existed.

The system prompt is never compressed, truncated, or modified. It contains identity, capabilities, and behavioral instructions that must remain intact throughout the entire conversation.

graph TB
subgraph "Context Window (200K tokens)"
SP["🔒 System Prompt<br/>~4K tokens<br/>NEVER compressed"]
style SP fill:#fbbf24,stroke:#333
subgraph "Compressible Zone"
OLD["Old Messages<br/>Compressed first"]
MID["Middle Messages<br/>Compressed second"]
RECENT["Recent Messages<br/>Preserved as-is"]
end
style OLD fill:#94a3b8
style MID fill:#cbd5e1
style RECENT fill:#4ade80
end

Claude Code implements a progressive 4-layer compression strategy, where each layer is more aggressive than the last. Layers trigger sequentially as the context grows.

Trigger: Individual tool results exceed a size threshold (e.g., >10K characters).

Action: Replace the middle portion of large tool results with a [snipped] marker, preserving the beginning and end.

function snipToolResult(content: string, maxChars: number): string {
if (content.length <= maxChars) return content;
const headSize = Math.floor(maxChars * 0.3); // Keep 30% from start
const tailSize = Math.floor(maxChars * 0.3); // Keep 30% from end
const head = content.slice(0, headSize);
const tail = content.slice(-tailSize);
const snipped = content.length - headSize - tailSize;
return `${head}\n\n[... ${snipped} characters snipped ...]\n\n${tail}`;
}

Why keep head AND tail? For code files, the head contains imports and module declarations; the tail contains exports and the most recently-read function. Both are high-value context.

Layer 2: Microcompact (Targeted Summarization)

Section titled “Layer 2: Microcompact (Targeted Summarization)”

Trigger: Total context exceeds ~60% of window capacity.

Action: Summarize old tool results into one-line descriptions using a lightweight model call.

interface MicrocompactResult {
original: ToolResultMessage;
summary: string;
tokensSaved: number;
}
async function microcompact(
toolResult: ToolResultMessage,
model: LLMClient,
): Promise<MicrocompactResult> {
const summary = await model.complete({
system: 'Summarize this tool result in ONE line. Include: tool name, key output, important values.',
messages: [{ role: 'user', content: toolResult.content }],
maxTokens: 100,
});
return {
original: toolResult,
summary: `[Compacted: ${summary}]`,
tokensSaved: countTokens(toolResult.content) - countTokens(summary),
};
}

Example:

  • Before (847 tokens): Full file content of /src/utils/parser.ts
  • After (23 tokens): [Compacted: ReadFile /src/utils/parser.ts — 245-line TypeScript file, exports parseConfig(), validateSchema(), 3 type definitions]

Layer 3: Auto Compact (Conversation Summarization)

Section titled “Layer 3: Auto Compact (Conversation Summarization)”

Trigger: Total context exceeds ~80% of window capacity.

Action: Replace the oldest N messages with a single summary message.

async function autoCompact(
messages: Message[],
threshold: number,
model: LLMClient,
): Promise<Message[]> {
const totalTokens = countTotalTokens(messages);
if (totalTokens < threshold) return messages;
// Find how many old messages to compress
const targetReduction = totalTokens - (threshold * 0.5); // Compress to 50%
let tokensToCompress = 0;
let compactUpTo = 0;
for (let i = 0; i < messages.length; i++) {
tokensToCompress += countTokens(messages[i]);
compactUpTo = i;
if (tokensToCompress >= targetReduction) break;
}
// Summarize old messages
const oldMessages = messages.slice(0, compactUpTo + 1);
const summary = await model.complete({
system: `Summarize this conversation segment. Preserve:
- Decisions made and their rationale
- Files modified and key changes
- Current task status and next steps
- Any errors encountered and resolutions`,
messages: [{ role: 'user', content: formatMessages(oldMessages) }],
maxTokens: 1000,
});
const summaryMessage: Message = {
role: 'user',
content: `[Conversation History Summary]\n${summary}\n[End Summary — recent messages follow]`,
};
return [summaryMessage, ...messages.slice(compactUpTo + 1)];
}

Trigger: Context exceeds ~95% of window capacity (emergency).

Action: Drop the oldest messages entirely, keeping only the system prompt and recent messages.

function hardTruncate(
messages: Message[],
maxTokens: number,
systemPromptTokens: number,
): Message[] {
const budget = maxTokens - systemPromptTokens - 1000; // 1K safety margin
const kept: Message[] = [];
let usedTokens = 0;
// Walk backward from most recent
for (let i = messages.length - 1; i >= 0; i--) {
const msgTokens = countTokens(messages[i]);
if (usedTokens + msgTokens > budget) break;
kept.unshift(messages[i]);
usedTokens += msgTokens;
}
return kept;
}
graph LR
subgraph "Context Usage"
A["0-40%<br/>No action"] --> B["40-60%<br/>Layer 1: Snip"]
B --> C["60-80%<br/>Layer 2: Microcompact"]
C --> D["80-95%<br/>Layer 3: Auto Compact"]
D --> E["95%+<br/>Layer 4: Hard Truncate"]
end
style A fill:#4ade80
style B fill:#a3e635
style C fill:#facc15
style D fill:#fb923c
style E fill:#ef4444
interface CompressionThresholds {
// Layer 1: Snip individual results
snipMaxChars: number; // Default: 10_000 chars per result
// Layer 2: Microcompact old results
microcompactTrigger: number; // Default: 0.6 (60% of window)
// Layer 3: Auto compact conversation
autoCompactTrigger: number; // Default: 0.8 (80% of window)
// Layer 4: Emergency truncation
hardTruncateTrigger: number; // Default: 0.95 (95% of window)
// Window capacity
maxContextTokens: number; // Default: 200_000 for Claude
}
const defaultThresholds: CompressionThresholds = {
snipMaxChars: 10_000,
microcompactTrigger: 0.6,
autoCompactTrigger: 0.8,
hardTruncateTrigger: 0.95,
maxContextTokens: 200_000,
};
async function manageContext(
messages: Message[],
systemPrompt: string,
thresholds: CompressionThresholds,
model: LLMClient,
): Promise<Message[]> {
let managed = [...messages];
const systemTokens = countTokens(systemPrompt);
// Layer 1: Snip oversized tool results (always active)
managed = managed.map(msg => {
if (isToolResult(msg) && msg.content.length > thresholds.snipMaxChars) {
return { ...msg, content: snipToolResult(msg.content, thresholds.snipMaxChars) };
}
return msg;
});
const totalTokens = () => systemTokens + countTotalTokens(managed);
const usage = () => totalTokens() / thresholds.maxContextTokens;
// Layer 2: Microcompact old tool results
if (usage() > thresholds.microcompactTrigger) {
const cutoff = Math.floor(managed.length * 0.5); // Compact oldest 50%
for (let i = 0; i < cutoff; i++) {
if (isToolResult(managed[i]) && !isAlreadyCompacted(managed[i])) {
const compacted = await microcompact(managed[i], model);
managed[i] = { ...managed[i], content: compacted.summary };
}
}
}
// Layer 3: Auto compact conversation
if (usage() > thresholds.autoCompactTrigger) {
managed = await autoCompact(managed, thresholds.autoCompactTrigger * thresholds.maxContextTokens, model);
}
// Layer 4: Hard truncation (emergency)
if (usage() > thresholds.hardTruncateTrigger) {
managed = hardTruncate(managed, thresholds.maxContextTokens, systemTokens);
}
return managed;
}
Content TypePriorityCompression Behavior
System prompt🔒 SacredNever touched
Last 3 user messages🔴 CriticalNever compressed
Last assistant message🔴 CriticalNever compressed
Recent tool results (last 5)🟡 HighSnipped if oversized
Old tool results🟢 LowMicrocompacted → dropped
Old assistant messages🟢 LowSummarized → dropped
Error messages🟡 MediumPreserved longer (debugging value)
File contents (large reads)🔵 LowestFirst to snip/compact
// ============================================
// Reusable Context Window Manager
// ============================================
interface ContextManager {
add(message: Message): void;
getMessages(): Message[];
getUsage(): { tokens: number; percentage: number };
compact(): Promise<void>;
}
function createContextManager(
maxTokens: number,
model: LLMClient,
options?: Partial<CompressionThresholds>,
): ContextManager {
const messages: Message[] = [];
const thresholds = { ...defaultThresholds, maxContextTokens: maxTokens, ...options };
return {
add(message: Message) {
// Auto-snip on insertion
if (isToolResult(message) && message.content.length > thresholds.snipMaxChars) {
message = { ...message, content: snipToolResult(message.content, thresholds.snipMaxChars) };
}
messages.push(message);
},
getMessages() {
return [...messages];
},
getUsage() {
const tokens = countTotalTokens(messages);
return { tokens, percentage: tokens / maxTokens };
},
async compact() {
const managed = await manageContext(messages, '', thresholds, model);
messages.length = 0;
messages.push(...managed);
},
};
}

From analyzing Claude Code’s behavior in long sessions:

Session length: 45 minutes, 32 turns
Without compression:
Total tokens accumulated: 287K ← Would exceed 200K window
Session would fail at turn ~22
With 4-layer compression:
Layer 1 (Snip): 287K → 198K (saved 89K from large file reads)
Layer 2 (Microcompact): 198K → 142K (saved 56K from old tool results)
Layer 3 (Auto Compact): 142K → 89K (saved 53K from conversation summary)
Layer 4 (Hard Truncate): Not triggered
Final context: 89K tokens (44% of window)
Session completed successfully ✅
graph LR
subgraph "Quality-Savings Tradeoff"
S["Snip<br/>Quality: 95%<br/>Savings: 30%"] --> M["Microcompact<br/>Quality: 85%<br/>Savings: 60%"]
M --> A["Auto Compact<br/>Quality: 70%<br/>Savings: 80%"]
A --> H["Hard Truncate<br/>Quality: 40%<br/>Savings: 95%"]
end
style S fill:#4ade80
style M fill:#a3e635
style A fill:#facc15
style H fill:#ef4444

Each layer trades more quality for more savings. The progressive design means you only pay the quality cost when absolutely necessary.

Chatbot Systems

Any long-running conversation that accumulates history. Without compression, chat sessions have a hard ceiling.

Agent Frameworks

LangChain, AutoGen, CrewAI — any framework that accumulates tool results needs a compression strategy.

Document Processing

Systems that process large documents in chunks. Each chunk’s output needs eventual compression.

Multi-Turn Reasoning

Complex reasoning tasks (e.g., code review, debugging) that require many back-and-forth turns.

Anti-PatternWhy It FailsBetter Approach
”Just use a bigger model”Context windows have hard limits; cost scales quadraticallyCompress proactively
Compress everything equallyRecent context is more valuable than oldProgressive, recency-weighted
Compress only when fullBy then it’s too late; emergency truncation loses qualityStart at 60% capacity
Never compress system prompt✅ This is correctKeep doing it
Summarize with the main modelExpensive; uses the same context you’re trying to saveUse a smaller, faster model