Chatbot Systems
Any long-running conversation that accumulates history. Without compression, chat sessions have a hard ceiling.
Every LLM has a finite context window. Claude’s is 200K tokens, which sounds enormous — until an agent has been running for 20 minutes, making file reads, running commands, and accumulating conversation history. Without management, the context window fills up and the agent dies.
Turn 1: System prompt (4K) + User message (0.5K) = 4.5K tokensTurn 5: + 5 assistant messages + 3 tool results = 28K tokensTurn 15: + 15 assistant messages + 40 tool results = 95K tokensTurn 25: + 25 assistant messages + 80 tool results = 175K tokens ← Danger zoneTurn 30: 💥 Context window exceeded — conversation failsClaude Code’s context management follows three inviolable rules:
The most recent messages are the most relevant. Compression always starts from the oldest messages and works forward.
A 500-line file read result can be summarized as "[Read /src/index.ts: 500 lines, TypeScript module with 12 exports]". The model already processed the content — it just needs a reminder that it existed.
The system prompt is never compressed, truncated, or modified. It contains identity, capabilities, and behavioral instructions that must remain intact throughout the entire conversation.
graph TB subgraph "Context Window (200K tokens)" SP["🔒 System Prompt<br/>~4K tokens<br/>NEVER compressed"] style SP fill:#fbbf24,stroke:#333
subgraph "Compressible Zone" OLD["Old Messages<br/>Compressed first"] MID["Middle Messages<br/>Compressed second"] RECENT["Recent Messages<br/>Preserved as-is"] end
style OLD fill:#94a3b8 style MID fill:#cbd5e1 style RECENT fill:#4ade80 endClaude Code implements a progressive 4-layer compression strategy, where each layer is more aggressive than the last. Layers trigger sequentially as the context grows.
Trigger: Individual tool results exceed a size threshold (e.g., >10K characters).
Action: Replace the middle portion of large tool results with a [snipped] marker, preserving the beginning and end.
function snipToolResult(content: string, maxChars: number): string { if (content.length <= maxChars) return content;
const headSize = Math.floor(maxChars * 0.3); // Keep 30% from start const tailSize = Math.floor(maxChars * 0.3); // Keep 30% from end
const head = content.slice(0, headSize); const tail = content.slice(-tailSize); const snipped = content.length - headSize - tailSize;
return `${head}\n\n[... ${snipped} characters snipped ...]\n\n${tail}`;}Why keep head AND tail? For code files, the head contains imports and module declarations; the tail contains exports and the most recently-read function. Both are high-value context.
Trigger: Total context exceeds ~60% of window capacity.
Action: Summarize old tool results into one-line descriptions using a lightweight model call.
interface MicrocompactResult { original: ToolResultMessage; summary: string; tokensSaved: number;}
async function microcompact( toolResult: ToolResultMessage, model: LLMClient,): Promise<MicrocompactResult> { const summary = await model.complete({ system: 'Summarize this tool result in ONE line. Include: tool name, key output, important values.', messages: [{ role: 'user', content: toolResult.content }], maxTokens: 100, });
return { original: toolResult, summary: `[Compacted: ${summary}]`, tokensSaved: countTokens(toolResult.content) - countTokens(summary), };}Example:
/src/utils/parser.ts[Compacted: ReadFile /src/utils/parser.ts — 245-line TypeScript file, exports parseConfig(), validateSchema(), 3 type definitions]Trigger: Total context exceeds ~80% of window capacity.
Action: Replace the oldest N messages with a single summary message.
async function autoCompact( messages: Message[], threshold: number, model: LLMClient,): Promise<Message[]> { const totalTokens = countTotalTokens(messages); if (totalTokens < threshold) return messages;
// Find how many old messages to compress const targetReduction = totalTokens - (threshold * 0.5); // Compress to 50% let tokensToCompress = 0; let compactUpTo = 0;
for (let i = 0; i < messages.length; i++) { tokensToCompress += countTokens(messages[i]); compactUpTo = i; if (tokensToCompress >= targetReduction) break; }
// Summarize old messages const oldMessages = messages.slice(0, compactUpTo + 1); const summary = await model.complete({ system: `Summarize this conversation segment. Preserve: - Decisions made and their rationale - Files modified and key changes - Current task status and next steps - Any errors encountered and resolutions`, messages: [{ role: 'user', content: formatMessages(oldMessages) }], maxTokens: 1000, });
const summaryMessage: Message = { role: 'user', content: `[Conversation History Summary]\n${summary}\n[End Summary — recent messages follow]`, };
return [summaryMessage, ...messages.slice(compactUpTo + 1)];}Trigger: Context exceeds ~95% of window capacity (emergency).
Action: Drop the oldest messages entirely, keeping only the system prompt and recent messages.
function hardTruncate( messages: Message[], maxTokens: number, systemPromptTokens: number,): Message[] { const budget = maxTokens - systemPromptTokens - 1000; // 1K safety margin const kept: Message[] = []; let usedTokens = 0;
// Walk backward from most recent for (let i = messages.length - 1; i >= 0; i--) { const msgTokens = countTokens(messages[i]); if (usedTokens + msgTokens > budget) break; kept.unshift(messages[i]); usedTokens += msgTokens; }
return kept;}graph LR subgraph "Context Usage" A["0-40%<br/>No action"] --> B["40-60%<br/>Layer 1: Snip"] B --> C["60-80%<br/>Layer 2: Microcompact"] C --> D["80-95%<br/>Layer 3: Auto Compact"] D --> E["95%+<br/>Layer 4: Hard Truncate"] end
style A fill:#4ade80 style B fill:#a3e635 style C fill:#facc15 style D fill:#fb923c style E fill:#ef4444interface CompressionThresholds { // Layer 1: Snip individual results snipMaxChars: number; // Default: 10_000 chars per result
// Layer 2: Microcompact old results microcompactTrigger: number; // Default: 0.6 (60% of window)
// Layer 3: Auto compact conversation autoCompactTrigger: number; // Default: 0.8 (80% of window)
// Layer 4: Emergency truncation hardTruncateTrigger: number; // Default: 0.95 (95% of window)
// Window capacity maxContextTokens: number; // Default: 200_000 for Claude}
const defaultThresholds: CompressionThresholds = { snipMaxChars: 10_000, microcompactTrigger: 0.6, autoCompactTrigger: 0.8, hardTruncateTrigger: 0.95, maxContextTokens: 200_000,};async function manageContext( messages: Message[], systemPrompt: string, thresholds: CompressionThresholds, model: LLMClient,): Promise<Message[]> { let managed = [...messages]; const systemTokens = countTokens(systemPrompt);
// Layer 1: Snip oversized tool results (always active) managed = managed.map(msg => { if (isToolResult(msg) && msg.content.length > thresholds.snipMaxChars) { return { ...msg, content: snipToolResult(msg.content, thresholds.snipMaxChars) }; } return msg; });
const totalTokens = () => systemTokens + countTotalTokens(managed); const usage = () => totalTokens() / thresholds.maxContextTokens;
// Layer 2: Microcompact old tool results if (usage() > thresholds.microcompactTrigger) { const cutoff = Math.floor(managed.length * 0.5); // Compact oldest 50% for (let i = 0; i < cutoff; i++) { if (isToolResult(managed[i]) && !isAlreadyCompacted(managed[i])) { const compacted = await microcompact(managed[i], model); managed[i] = { ...managed[i], content: compacted.summary }; } } }
// Layer 3: Auto compact conversation if (usage() > thresholds.autoCompactTrigger) { managed = await autoCompact(managed, thresholds.autoCompactTrigger * thresholds.maxContextTokens, model); }
// Layer 4: Hard truncation (emergency) if (usage() > thresholds.hardTruncateTrigger) { managed = hardTruncate(managed, thresholds.maxContextTokens, systemTokens); }
return managed;}| Content Type | Priority | Compression Behavior |
|---|---|---|
| System prompt | 🔒 Sacred | Never touched |
| Last 3 user messages | 🔴 Critical | Never compressed |
| Last assistant message | 🔴 Critical | Never compressed |
| Recent tool results (last 5) | 🟡 High | Snipped if oversized |
| Old tool results | 🟢 Low | Microcompacted → dropped |
| Old assistant messages | 🟢 Low | Summarized → dropped |
| Error messages | 🟡 Medium | Preserved longer (debugging value) |
| File contents (large reads) | 🔵 Lowest | First to snip/compact |
// ============================================// Reusable Context Window Manager// ============================================
interface ContextManager { add(message: Message): void; getMessages(): Message[]; getUsage(): { tokens: number; percentage: number }; compact(): Promise<void>;}
function createContextManager( maxTokens: number, model: LLMClient, options?: Partial<CompressionThresholds>,): ContextManager { const messages: Message[] = []; const thresholds = { ...defaultThresholds, maxContextTokens: maxTokens, ...options };
return { add(message: Message) { // Auto-snip on insertion if (isToolResult(message) && message.content.length > thresholds.snipMaxChars) { message = { ...message, content: snipToolResult(message.content, thresholds.snipMaxChars) }; } messages.push(message); },
getMessages() { return [...messages]; },
getUsage() { const tokens = countTotalTokens(messages); return { tokens, percentage: tokens / maxTokens }; },
async compact() { const managed = await manageContext(messages, '', thresholds, model); messages.length = 0; messages.push(...managed); }, };}From analyzing Claude Code’s behavior in long sessions:
Session length: 45 minutes, 32 turns
Without compression: Total tokens accumulated: 287K ← Would exceed 200K window Session would fail at turn ~22
With 4-layer compression: Layer 1 (Snip): 287K → 198K (saved 89K from large file reads) Layer 2 (Microcompact): 198K → 142K (saved 56K from old tool results) Layer 3 (Auto Compact): 142K → 89K (saved 53K from conversation summary) Layer 4 (Hard Truncate): Not triggered
Final context: 89K tokens (44% of window) Session completed successfully ✅graph LR subgraph "Quality-Savings Tradeoff" S["Snip<br/>Quality: 95%<br/>Savings: 30%"] --> M["Microcompact<br/>Quality: 85%<br/>Savings: 60%"] M --> A["Auto Compact<br/>Quality: 70%<br/>Savings: 80%"] A --> H["Hard Truncate<br/>Quality: 40%<br/>Savings: 95%"] end
style S fill:#4ade80 style M fill:#a3e635 style A fill:#facc15 style H fill:#ef4444Each layer trades more quality for more savings. The progressive design means you only pay the quality cost when absolutely necessary.
Chatbot Systems
Any long-running conversation that accumulates history. Without compression, chat sessions have a hard ceiling.
Agent Frameworks
LangChain, AutoGen, CrewAI — any framework that accumulates tool results needs a compression strategy.
Document Processing
Systems that process large documents in chunks. Each chunk’s output needs eventual compression.
Multi-Turn Reasoning
Complex reasoning tasks (e.g., code review, debugging) that require many back-and-forth turns.
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| ”Just use a bigger model” | Context windows have hard limits; cost scales quadratically | Compress proactively |
| Compress everything equally | Recent context is more valuable than old | Progressive, recency-weighted |
| Compress only when full | By then it’s too late; emergency truncation loses quality | Start at 60% capacity |
| Never compress system prompt | ✅ This is correct | Keep doing it |
| Summarize with the main model | Expensive; uses the same context you’re trying to save | Use a smaller, faster model |