84994f9282
Move persistent memory from the user message into system blocks with cache_control: ephemeral on the last block. The static prefix (system prompt + memory, ~3-5K tokens typically) is identical between the two LLM calls of a tool_use round-trip and stable across turns within the 5-minute cache TTL. Without this, the tool-call retrieval architecture roughly doubled input token cost on retrieval-needed turns (full context billed twice). With cache reads at ~10% of standard input, the duplication cost drops by ~90% — the "twice as expensive" hit becomes "slightly more expensive plus tool overhead." client_time stays in the user message (per-turn dynamic, should not be in the cached prefix).