Pattern: Fork & Cache Reuse

模式本质

当父 agent 派生出一个 sub-agent（fork）时，sub-agent 通常需要相同的 system prompt、对话 context 和 tool 定义。朴素的做法是每次 fork 都发送独立的 API 请求 —— 为完全相同的前缀付全价。

Fork & Cache Reuse 利用 Claude 的 Prompt Cache，确保所有 sub-agent 共享同一个 system prompt 前缀，从而复用已缓存的 token，而不是重新处理它们。

graph TB
    subgraph "Without Cache Reuse"
        P1["Parent Agent<br/>System: 4K tokens<br/>Context: 20K tokens"]
        F1["Fork A<br/>System: 4K tokens ← re-processed<br/>Context: 20K tokens ← re-processed"]
        F2["Fork B<br/>System: 4K tokens ← re-processed<br/>Context: 20K tokens ← re-processed"]
        P1 --> F1
        P1 --> F2
    end

    subgraph "With Cache Reuse"
        P2["Parent Agent<br/>System: 4K tokens<br/>Context: 20K tokens"]
        F3["Fork A<br/>System: 4K tokens ← CACHED ✅<br/>Context: 20K tokens ← CACHED ✅"]
        F4["Fork B<br/>System: 4K tokens ← CACHED ✅<br/>Context: 20K tokens ← CACHED ✅"]
        P2 --> F3
        P2 --> F4
    end

    style P1 fill:#94a3b8
    style F1 fill:#fca5a5
    style F2 fill:#fca5a5
    style P2 fill:#4ade80
    style F3 fill:#4ade80
    style F4 fill:#4ade80

Prompt Cache 工作原理

Claude 的 Prompt Cache 基于前缀匹配：

请求 A：[System Prompt][Context A][User Message A]
                       ↑ cache 断点

请求 B：[System Prompt][Context B][User Message B]
        ^^^^^^^^^^^^^^^^ 如果前缀相同则命中 cache

cache 从请求的起点开始匹配。共享的前缀被缓存；一旦内容分叉，缓存就停止。这对 fork 设计有一个关键含义：

前缀规则

// ✅ Cache 友好：相同的前缀
const parentRequest = {
  system: SHARED_SYSTEM_PROMPT,      // 所有 fork 相同
  messages: [
    ...sharedContext,                 // 共享的对话历史
    { role: 'user', content: '针对 Fork A 的特定任务' },  // 从此处分叉
  ],
};

const forkRequest = {
  system: SHARED_SYSTEM_PROMPT,      // ← 相同前缀：已缓存
  messages: [
    ...sharedContext,                 // ← 相同前缀：已缓存
    { role: 'user', content: '针对 Fork B 的特定任务' },  // 从此处分叉
  ],
};

// ❌ Cache 不友好：不同的前缀
const forkRequest = {
  system: SHARED_SYSTEM_PROMPT + '\nYou are Sub-Agent B.',  // 不同！
  messages: [
    { role: 'user', content: 'Fork 特定的前言' },           // 不同！
    ...sharedContext,                                        // 为时已晚 —— 前缀已分叉
  ],
};

经济分析

场景：5 个文件 Fork 的代码审查

父 agent 审查一个 PR，并 fork sub-agent 独立分析每个变更文件。

共享前缀：
  - System prompt：        4,000 token
  - 项目 context：         8,000 token
  - PR 描述：              2,000 token
  - 共享指令：             1,000 token
  共享前缀合计：          15,000 token

Fork 特定后缀：
  - 文件内容：             ~3,000 token（每个）
  - 分析 prompt：          ~500 token（每个）
  每个 fork 合计：        ~3,500 token

成本对比（Claude Sonnet 定价）：

指标	无 cache	有 cache	节省
父请求	15,000 token	15,000 token	—
Fork A 输入	18,500 token	3,500 新鲜 + 15,000 已缓存	15K 上节省 90%
Fork B 输入	18,500 token	3,500 新鲜 + 15,000 已缓存	15K 上节省 90%
Fork C 输入	18,500 token	3,500 新鲜 + 15,000 已缓存	15K 上节省 90%
Fork D 输入	18,500 token	3,500 新鲜 + 15,000 已缓存	15K 上节省 90%
Fork E 输入	18,500 token	3,500 新鲜 + 15,000 已缓存	15K 上节省 90%
输入合计	107,500 token	15,000 + 17,500 新鲜 + 75,000 已缓存	—
有效成本	107,500 × $3/M	32,500 × $3/M + 75,000 × $0.30/M	~$0.30 → ~$0.12
节省	—	—	~60%

实现模式

第一步：设计对 Cache 友好的 System Prompt

// system prompt 的结构为最大化 cache 复用而设计
function buildSystemPrompt(config: AgentConfig): string {
  // 静态部分 —— 所有 agent 和 fork 都相同
  const staticSection = [
    IDENTITY_PROMPT,          // "You are Claude, made by Anthropic..."
    CAPABILITIES_PROMPT,      // Tool 描述、行为规则
    SAFETY_PROMPT,            // 安全策略
  ].join('\n\n');

  // 半静态部分 —— 每个项目不同，但在 fork 间共享
  const projectSection = [
    `Project: ${config.projectName}`,
    `Working directory: ${config.cwd}`,
    config.claudeMdContent,   // CLAUDE.md 内容
  ].join('\n\n');

  // fork 特定的部分放在 messages 中，而不是 system prompt 里
  return `${staticSection}\n\n${projectSection}`;
}

第二步：使用共享前缀 Fork

interface ForkOptions {
  task: string;
  parentMessages: Message[];
  sharedPrefixLength: number;  // 多少条消息构成共享前缀
}

function createFork(
  parentSystemPrompt: string,
  options: ForkOptions,
): APIRequest {
  // 共享父级的消息前缀以复用 cache
  const sharedMessages = options.parentMessages.slice(0, options.sharedPrefixLength);

  return {
    system: parentSystemPrompt,  // 相同 —— 将被缓存
    messages: [
      ...sharedMessages,         // 相同前缀 —— 将被缓存
      {
        role: 'user',
        content: `Sub-task: ${options.task}`,  // Fork 特定 —— 不缓存
      },
    ],
  };
}

第三步：编排 Fork 以预热 Cache

async function executeForksWithCacheReuse(
  systemPrompt: string,
  sharedMessages: Message[],
  tasks: string[],
): Promise<ForkResult[]> {
  // 第一步：父级的最后一次请求已经预热了 cache
  // （system prompt + 共享消息现在已在 cache 中）

  // 第二步：触发所有 fork —— 它们都共享已缓存的前缀
  const forkPromises = tasks.map(task =>
    callAPI({
      system: systemPrompt,      // cache 命中
      messages: [
        ...sharedMessages,       // cache 命中
        { role: 'user', content: `Analyze: ${task}` },
      ],
    })
  );

  // 第三步：所有 fork 以约 90% 的已缓存输入执行
  return Promise.all(forkPromises);
}

Cache 对齐策略

策略 1：静态前缀 + 动态后缀

[已缓存] System Prompt → 项目 Context → 共享历史
[新鲜]  Fork 特定的任务描述

最简单也最有效。适用于所有 fork 共享同一对话 context 的场景。

策略 2：基于 Checkpoint 的缓存

// 在对话里程碑处创建显式 cache checkpoint
function createCacheCheckpoint(messages: Message[]): CacheCheckpoint {
  return {
    messages: [...messages],
    tokenCount: countTokens(messages),
    timestamp: Date.now(),
  };
}

// Fork 引用 checkpoint 而非实时对话
function forkFromCheckpoint(
  checkpoint: CacheCheckpoint,
  task: string,
): APIRequest {
  return {
    system: systemPrompt,
    messages: [
      ...checkpoint.messages,  // 与使用同一 checkpoint 的其他 fork 相同
      { role: 'user', content: task },
    ],
  };
}

策略 3：分层缓存

graph LR
    L1["第一层：身份<br/>~2K token<br/>Cache：所有请求"] --> L2["第二层：项目<br/>~5K token<br/>Cache：同一项目"]
    L2 --> L3["第三层：对话<br/>~10K token<br/>Cache：同一会话 fork"]
    L3 --> L4["第四层：Fork 任务<br/>~2K token<br/>不缓存"]

    style L1 fill:#4ade80
    style L2 fill:#a3e635
    style L3 fill:#facc15
    style L4 fill:#94a3b8

限制与权衡

Cache TTL

Prompt cache 有存活时间（通常为 5 分钟）。如果 fork 启动间隔过长，早期的 cache 可能已过期。

// 缓解方案：尽可能近地启动所有 fork
async function launchForksQuickly(tasks: string[]) {
  // ✅ 好：所有 fork 在毫秒内启动
  const results = await Promise.all(tasks.map(t => launchFork(t)));

  // ❌ 差：顺序加延迟 —— 后面的 fork 可能错过 cache
  for (const task of tasks) {
    await launchFork(task);  // 每个 fork 30s = 5 个 fork 需 2.5 分钟
    await delay(30_000);
  }
}

前缀刚性

共享前缀的任何变更都会使 cache 失效。这在个性化与缓存之间造成了张力：

// ❌ Cache 不友好：在 system prompt 中进行 per-fork 定制
system: `${BASE_PROMPT}\nYou are analyzing file: ${filename}`

// ✅ Cache 友好：在 messages 中进行定制
system: BASE_PROMPT,
messages: [...shared, { role: 'user', content: `Analyze file: ${filename}` }]

最小 Cache 大小

Claude 的 Prompt Cache 有最小前缀长度（Sonnet 通常为 1,024 token，Opus 为 2,048）。非常短的 system prompt 不会受益。

成本开销

Cache 写入有小额附加费（首次请求约 25%）。只有当 cache 被后续请求读取时才能实现节省。单 fork 场景实际上可能成本略高。

场景	Cache 收益
1 个 fork	❌ 净成本增加（写入开销）
2 个 fork	⚠️ 收支平衡
3 个以上 fork	✅ 显著节省
5 个以上 fork	✅✅ 大幅节省（60%+）

可复用模板

// ============================================
// Fork & Cache Reuse 管理器
// ============================================

interface CacheAwareForkManager {
  warmCache(systemPrompt: string, sharedMessages: Message[]): Promise<void>;
  fork(task: string): Promise<ForkResult>;
  forkAll(tasks: string[]): Promise<ForkResult[]>;
}

function createForkManager(
  apiClient: APIClient,
  systemPrompt: string,
  sharedMessages: Message[],
): CacheAwareForkManager {
  let cacheWarmed = false;

  return {
    async warmCache() {
      // 发送轻量级请求以填充 cache
      await apiClient.complete({
        system: systemPrompt,
        messages: [...sharedMessages, { role: 'user', content: 'Acknowledge.' }],
        maxTokens: 10,
      });
      cacheWarmed = true;
    },

    async fork(task: string) {
      return apiClient.complete({
        system: systemPrompt,
        messages: [
          ...sharedMessages,
          { role: 'user', content: task },
        ],
      });
    },

    async forkAll(tasks: string[]) {
      if (!cacheWarmed) await this.warmCache();

      // 同时启动所有 fork 以最大化 cache 命中
      return Promise.all(tasks.map(task => this.fork(task)));
    },
  };
}

决策指南

graph TD
    A["需要 sub-agent？"] -->|是| B["需要几个 fork？"]
    A -->|否| Z["无需 cache 优化"]

    B -->|"1 个"| C["跳过 cache 优化<br/>写入开销不值得"]
    B -->|"2 个"| D["共享前缀 > 10K token 时使用"]
    B -->|"3 个以上"| E["始终使用 Fork & Cache"]

    E --> F["共享前缀 > 1024 token？"]
    F -->|是| G["✅ 实现此 pattern"]
    F -->|否| H["扩展 system prompt<br/>以达到最小值"]

    G --> I["在 cache TTL 窗口内<br/>启动 fork"]