b936931668
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.
stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
commits, all sharing the same saga tag for Graphiti document linking.
Original implementation sent all chunks as one saga; 17-19 chunk sagas
deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.
Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
Workers have no sd_notify support; per-request timeouts in code
handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
systemctl restart ollama and restart aaronai-graphiti.service.
Stage 2's existing recover_wedge() was silently broken since
deployment due to this gap.
.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).
Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.
Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.