84994f92827e411e7be9d10ce6a1b85e65046f56
Move persistent memory from the user message into system blocks with cache_control: ephemeral on the last block. The static prefix (system prompt + memory, ~3-5K tokens typically) is identical between the two LLM calls of a tool_use round-trip and stable across turns within the 5-minute cache TTL. Without this, the tool-call retrieval architecture roughly doubled input token cost on retrieval-needed turns (full context billed twice). With cache reads at ~10% of standard input, the duplication cost drops by ~90% — the "twice as expensive" hit becomes "slightly more expensive plus tool overhead." client_time stays in the user message (per-turn dynamic, should not be in the cached prefix).
Description
No description provided
Languages
Python
95.9%
HTML
3.7%
Shell
0.4%