Error 处理

Claude Code 的 error 处理十分精密，因为该工具必须优雅地应对各种 API 失败——从瞬时限速到认证 error，再到完全的 API 中断。系统以 src/services/api/ 中的 withRetry.ts 和 errors.ts 为核心。

Error 处理架构

graph TB
    subgraph "API Call"
        CALL["claude.ts → API request"]
    end

    subgraph "withRetry.ts"
        RETRY["Retry loop (max 10 attempts)"]
        CLASS["Error classification"]
        BACKOFF["Exponential backoff + jitter"]
        FASTMODE["Fast mode fallback"]
        FALLBACK["Model fallback (529)"]
        PERSIST["Persistent retry (unattended)"]
    end

    subgraph "errors.ts"
        MSG["User-facing error messages"]
        CLASSIFY["Error type classification"]
        ANALYTICS["Analytics tagging"]
    end

    subgraph "Consumer"
        UI["REPL UI display"]
        SDK_OUT["SDK error messages"]
    end

    CALL --> RETRY
    RETRY -->|Retryable| BACKOFF
    RETRY -->|Fast mode| FASTMODE
    RETRY -->|Repeated 529| FALLBACK
    RETRY -->|Unattended 429/529| PERSIST
    RETRY -->|Non-retryable| CLASS
    CLASS --> MSG
    MSG --> UI
    MSG --> SDK_OUT
    BACKOFF --> RETRY
    FASTMODE --> RETRY

withRetry 模式

src/services/api/withRetry.ts 中的 withRetry() 函数是一个 async generator，用于为 API 调用包装 retry 逻辑：

export async function* withRetry<T>(
  getClient: () => Promise<Anthropic>,
  operation: (client: Anthropic, attempt: number, context: RetryContext) => Promise<T>,
  options: RetryOptions,
): AsyncGenerator<SystemAPIErrorMessage, T> {
  const maxRetries = getMaxRetries(options)  // Default: 10
  const retryContext: RetryContext = {
    model: options.model,
    thinkingConfig: options.thinkingConfig,
    ...(isFastModeEnabled() && { fastMode: options.fastMode }),
  }

  let client: Anthropic | null = null
  let consecutive529Errors = options.initialConsecutive529Errors ?? 0

  for (let attempt = 1; attempt <= maxRetries + 1; attempt++) {
    if (options.signal?.aborted) throw new APIUserAbortError()

    try {
      if (client === null || /* auth error on last attempt */) {
        client = await getClient()
      }
      return await operation(client, attempt, retryContext)
    } catch (error) {
      // Classification and retry logic...
    }
  }
  throw new CannotRetryError(lastError, retryContext)
}

关键设计：withRetry 在等待期间产出 SystemAPIErrorMessage（用于 UI 显示），并通过 generator 的返回值返回最终结果。

HTTP Error 区分

Claude Code 按 HTTP 状态码对 error 分类，并采取不同处理：

429 — 速率限制

// Rate limit handling depends on subscription type
if (error.status === 429) {
  // ClaudeAI subscribers (Max/Pro): don't retry (wait could be hours)
  // Enterprise subscribers: retry (typically PAYG, short limits)
  // API key users: retry with backoff
  return !isClaudeAISubscriber() || isEnterpriseSubscriber()
}

速率限制响应包含指导行为的 header：

// Headers checked for rate limiting
'anthropic-ratelimit-unified-representative-claim'  // 'five_hour' | 'seven_day'
'anthropic-ratelimit-unified-overage-status'         // 'allowed' | 'rejected'
'anthropic-ratelimit-unified-reset'                  // Unix timestamp
'anthropic-ratelimit-unified-overage-disabled-reason' // Why extra usage is blocked

529 — 服务器过载

export function is529Error(error: unknown): boolean {
  if (!(error instanceof APIError)) return false
  return (
    error.status === 529 ||
    // SDK sometimes fails to pass 529 status during streaming
    (error.message?.includes('"type":"overloaded_error"') ?? false)
  )
}

529 error 有特殊的 query 来源处理——仅前台 query 进行 retry：

// Only foreground sources retry on 529 to avoid amplification
const FOREGROUND_529_RETRY_SOURCES = new Set<QuerySource>([
  'repl_main_thread',
  'sdk',
  'agent:custom',
  'agent:default',
  'compact',
  'auto_mode',
  // Background sources (summaries, titles, classifiers) bail immediately
])

function shouldRetry529(querySource: QuerySource | undefined): boolean {
  return querySource === undefined || FOREGROUND_529_RETRY_SOURCES.has(querySource)
}

设计原理：在容量级联故障期间，每次 retry 会将负载放大 3-10 倍。后台 query（标题生成、建议）用户根本看不到，应该静默失败，而不是加剧级联。

500+ — 服务器 Error

// Always retry internal server errors
if (error.status && error.status >= 500) return true

401 — 认证 Error

if (error.status === 401) {
  // Clear cached API key and retry
  clearApiKeyHelperCache()

  // For OAuth: force token refresh
  if (lastError instanceof APIError && lastError.status === 401) {
    const failedAccessToken = getClaudeAIOAuthTokens()?.accessToken
    if (failedAccessToken) {
      await handleOAuth401Error(failedAccessToken)
    }
  }

  return true  // Retry with refreshed credentials
}

连接 Error（ECONNRESET/EPIPE）

function isStaleConnectionError(error: unknown): boolean {
  if (!(error instanceof APIConnectionError)) return false
  const details = extractConnectionErrorDetails(error)
  return details?.code === 'ECONNRESET' || details?.code === 'EPIPE'
}

// On stale connection: disable keep-alive and reconnect
if (isStaleConnection) {
  disableKeepAlive()
  client = await getClient()  // Force new connection
}

指数退避与抖动

export const BASE_DELAY_MS = 500

export function getRetryDelay(
  attempt: number,
  retryAfterHeader?: string | null,
  maxDelayMs = 32000,
): number {
  // Honor server's Retry-After header if present
  if (retryAfterHeader) {
    const seconds = parseInt(retryAfterHeader, 10)
    if (!isNaN(seconds)) return seconds * 1000
  }

  // Exponential backoff: 500ms, 1s, 2s, 4s, 8s, 16s, 32s (capped)
  const baseDelay = Math.min(
    BASE_DELAY_MS * Math.pow(2, attempt - 1),
    maxDelayMs,
  )
  // Add 25% jitter to prevent thundering herd
  const jitter = Math.random() * 0.25 * baseDelay
  return baseDelay + jitter
}

默认设置下的 retry 延迟序列：

尝试次数	基础延迟	含抖动（约）
1	500ms	500-625ms
2	1,000ms	1,000-1,250ms
3	2,000ms	2,000-2,500ms
4	4,000ms	4,000-5,000ms
5	8,000ms	8,000-10,000ms
6	16,000ms	16,000-20,000ms
7+	32,000ms	32,000-40,000ms

Fast Mode → Normal Mode 降级

当 fast mode 激活时，429/529 error 会触发降级机制：

const SHORT_RETRY_THRESHOLD_MS = 20 * 1000      // 20 seconds
const MIN_COOLDOWN_MS = 10 * 60 * 1000           // 10 minutes
const DEFAULT_FAST_MODE_FALLBACK_HOLD_MS = 30 * 60 * 1000  // 30 minutes

if (wasFastModeActive && (error.status === 429 || is529Error(error))) {
  const retryAfterMs = getRetryAfterMs(error)

  if (retryAfterMs !== null && retryAfterMs < SHORT_RETRY_THRESHOLD_MS) {
    // Short retry-after (<20s): wait and retry with fast mode still active
    // Preserves prompt cache (same model name)
    await sleep(retryAfterMs, options.signal)
    continue
  }

  // Long or unknown retry-after: enter cooldown
  const cooldownMs = Math.max(
    retryAfterMs ?? DEFAULT_FAST_MODE_FALLBACK_HOLD_MS,
    MIN_COOLDOWN_MS,
  )
  triggerFastModeCooldown(Date.now() + cooldownMs, cooldownReason)
  retryContext.fastMode = false
  continue
}

graph TB
    A["429/529 in Fast Mode"] --> B{Retry-After < 20s?}
    B -->|Yes| C["Wait & retry<br/>(keep fast mode)"]
    B -->|No| D["Enter cooldown<br/>(switch to normal)"]
    D --> E["Cooldown for<br/>max(retryAfter, 10min)"]
    E --> F["Retry with<br/>normal mode model"]

    A --> G{Overage disabled?}
    G -->|Yes| H["Permanently disable<br/>fast mode"]

模型降级（529 → Sonnet）

连续 3 次 529 error 后，Claude Code 可以从主模型切换到降级模型：

const MAX_529_RETRIES = 3

if (is529Error(error)) {
  consecutive529Errors++
  if (consecutive529Errors >= MAX_529_RETRIES) {
    if (options.fallbackModel) {
      // Throw special error — caller catches and retries with fallback model
      throw new FallbackTriggeredError(options.model, options.fallbackModel)
    }

    // External users with no fallback: give up
    if (process.env.USER_TYPE === 'external') {
      throw new CannotRetryError(
        new Error(REPEATED_529_ERROR_MESSAGE),
        retryContext,
      )
    }
  }
}

FallbackTriggeredError 由 query.ts 捕获，后者使用降级模型重新发起 API 调用。

无人值守会话的持久 Retry

对于无人值守（无头）会话，Claude Code 支持持久 retry——无限 retry 并带有保活心跳：

const PERSISTENT_MAX_BACKOFF_MS = 5 * 60 * 1000     // 5 minutes max backoff
const PERSISTENT_RESET_CAP_MS = 6 * 60 * 60 * 1000   // 6 hours max wait
const HEARTBEAT_INTERVAL_MS = 30_000                   // 30 second heartbeats

function isPersistentRetryEnabled(): boolean {
  return isEnvTruthy(process.env.CLAUDE_CODE_UNATTENDED_RETRY)
}

持久 retry 激活时：

if (persistent) {
  // Chunk long sleeps to emit heartbeats
  let remaining = delayMs
  while (remaining > 0) {
    if (options.signal?.aborted) throw new APIUserAbortError()

    // Yield status message as heartbeat
    yield createSystemAPIErrorMessage(error, remaining, reportedAttempt, maxRetries)

    const chunk = Math.min(remaining, HEARTBEAT_INTERVAL_MS)
    await sleep(chunk, options.signal)
    remaining -= chunk
  }

  // Clamp attempt counter — the for-loop never terminates
  if (attempt >= maxRetries) attempt = maxRetries
}

为什么需要心跳？ 宿主环境（CI 系统、编排器）可能会终止空闲会话。每个产出的 SystemAPIErrorMessage 通过 QueryEngine 产生 stdout 活动，保持会话存活。

对于带有速率限制重置 header 的 429 error，持久 retry 会遵循精确的重置时间：

function getRateLimitResetDelayMs(error: APIError): number | null {
  const resetHeader = error.headers?.get?.('anthropic-ratelimit-unified-reset')
  if (!resetHeader) return null
  const resetUnixSec = Number(resetHeader)
  const delayMs = resetUnixSec * 1000 - Date.now()
  return Math.min(delayMs, PERSISTENT_RESET_CAP_MS)
}

用于分析的 Error 分类

errors.ts 中的 classifyAPIError() 函数将 error 映射到标准化标签：

export function classifyAPIError(error: unknown): string {
  if (error instanceof APIConnectionTimeoutError) return 'api_timeout'
  if (error.message.includes(REPEATED_529_ERROR_MESSAGE)) return 'repeated_529'
  if (error instanceof APIError && error.status === 429) return 'rate_limit'
  if (error instanceof APIError && error.status === 529) return 'server_overload'
  if (error.message.includes('prompt is too long')) return 'prompt_too_long'
  if (error.message.includes('x-api-key')) return 'invalid_api_key'
  if (error instanceof APIError && error.status >= 500) return 'server_error'
  if (error instanceof APIConnectionError) {
    const details = extractConnectionErrorDetails(error)
    if (details?.isSSLError) return 'ssl_cert_error'
    return 'connection_error'
  }
  return 'unknown'
}

完整分类分类法：

Error 类型	HTTP 状态	描述
`api_timeout`	—	连接超时
`rate_limit`	429	速率限制
`server_overload`	529	API 过载
`repeated_529`	529	连续 3+ 次 529
`prompt_too_long`	400	输入超出 context window
`pdf_too_large`	400	PDF 超出页数限制
`image_too_large`	400	图片超出大小限制
`tool_use_mismatch`	400	tool_use/tool_result 配对 error
`invalid_model`	400	模型名称无法识别
`credit_balance_low`	—	余额不足
`invalid_api_key`	401	API key 无效
`token_revoked`	403	OAuth token 已撤销
`auth_error`	401/403	通用认证失败
`server_error`	500+	内部服务器 error
`connection_error`	—	网络连接
`ssl_cert_error`	—	SSL/TLS 证书问题

面向用户的 Error 消息

getAssistantMessageFromError() 函数将 API error 转换为用户友好的消息：

export function getAssistantMessageFromError(
  error: unknown,
  model: string,
): AssistantMessage {
  // Timeout → "Request timed out"
  // Image too large → "Image was too large. Try resizing..."
  // Prompt too long → "Prompt is too long"
  // 429 with headers → Specific rate limit message with reset time
  // 401 → "Please run /login" or "Invalid API key"
  // 403 OAuth revoked → "OAuth token revoked · Please run /login"
  // 529 → "Repeated 529 Overloaded errors"
  // Generic → "API Error: {message}"
}

上下文感知消息

Error 消息会根据执行上下文自适应：

// Interactive mode gets UI hints
'PDF too large. Double press esc to go back and try again'

// SDK/headless mode gets actionable advice
'PDF too large. Try reading the file a different way (e.g., extract text with pdftotext).'

Error 处理流程总结

flowchart TD
    ERR["API Error"] --> IS_ABORT{Aborted?}
    IS_ABORT -->|Yes| THROW_ABORT["Throw APIUserAbortError"]
    IS_ABORT -->|No| IS_FAST{Fast mode active?}

    IS_FAST -->|Yes| FAST_429{429/529?}
    FAST_429 -->|Short retry| FAST_RETRY["Wait, keep fast mode"]
    FAST_429 -->|Long retry| FAST_COOL["Cooldown, switch normal"]

    IS_FAST -->|No| IS_529{529?}
    IS_529 -->|Yes| FG{Foreground query?}
    FG -->|No| DROP["Drop immediately<br/>(no amplification)"]
    FG -->|Yes| COUNT{3+ consecutive?}
    COUNT -->|Yes| FALLBACK["FallbackTriggeredError<br/>(switch model)"]
    COUNT -->|No| RETRY_529["Retry with backoff"]

    IS_529 -->|No| IS_429{429?}
    IS_429 -->|Yes| SUB{Subscriber type?}
    SUB -->|ClaudeAI Max/Pro| NO_RETRY["Show rate limit message"]
    SUB -->|Enterprise/API| RETRY_429["Retry with backoff"]

    IS_429 -->|No| IS_AUTH{401/403?}
    IS_AUTH -->|Yes| REFRESH["Refresh credentials, retry"]
    IS_AUTH -->|No| IS_5XX{5xx?}
    IS_5XX -->|Yes| RETRY_5XX["Retry with backoff"]
    IS_5XX -->|No| CANNOT_RETRY["CannotRetryError"]