aaronAI

T

aaron 84994f9282 api.py: prompt-cache system prompt and memory across tool_use round-trip

Move persistent memory from the user message into system blocks with
cache_control: ephemeral on the last block. The static prefix (system prompt +
memory, ~3-5K tokens typically) is identical between the two LLM calls of a
tool_use round-trip and stable across turns within the 5-minute cache TTL.

Without this, the tool-call retrieval architecture roughly doubled input
token cost on retrieval-needed turns (full context billed twice). With cache
reads at ~10% of standard input, the duplication cost drops by ~90% — the
"twice as expensive" hit becomes "slightly more expensive plus tool overhead."

client_time stays in the user message (per-turn dynamic, should not be in the
cached prefix).

2026-05-19 23:13:43 +00:00

deprecated

chore: archive deprecated chromadb and migration scripts

2026-04-28 00:15:46 +00:00

docs

docs: OCR install record for 2026-05-04

2026-05-04 16:58:30 +00:00

experiments

embeddings: backfill type and created_at (Improvement #2 part A)

2026-05-03 23:58:53 +00:00

scripts

api.py: prompt-cache system prompt and memory across tool_use round-trip