Bug: when a row in stage_3_queue gets re-enqueued (same source ingested
again after Stage 2 re-runs), the ON CONFLICT (source) DO UPDATE clause
updated content fields and reset enqueued_at, completed_at, failed_at,
attempts — but did not reset started_at, failure_reason, or
external_job_id.
Stale started_at from a prior attempt makes the row invisible to the
Stage 3 worker's claim filter (which uses started_at IS NULL). The row
sits queued forever; Stage 3 never picks it up; the source effectively
fails silently after a re-trigger.
Discovered tonight while testing the bulk pathway after the substrate
fix: a journal entry that had been ingested earlier (and manually marked
completed during recovery from a worker timeout) showed enqueued_at
from the new touch but started_at from the original 01:40 attempt. Fix
extends the upsert clause to NULL all run-state fields so re-enqueue
behaves as 'fresh attempt.'
After fix, re-triggered journal entry routed cleanly through Stage 2 →
Stage 3 → bulk pathway → sidecar bulk job → 60ms commit (worst-case
dedup against already-known content).
Major rewrite of the Graphiti sidecar. Two architectural changes:
PATTERN 1 ASYNC JOB MODEL
Submission and completion are decoupled. POST /episodes and
POST /episodes/bulk return job_id immediately; the actual graphiti-core
work happens in a background asyncio task. Submitters poll
GET /jobs/{job_id} until terminal status (committed | failed).
Why: tonight's smoke test confirmed that bulk ingest against the
4,222-entity graph was committing successfully even when the worker's
HTTP read-timeout fired. The synchronous interface was producing
false-negative failures — work succeeded but the worker stopped
listening at the 10-minute read-timeout. Three days of 'saga deadlock'
failures reframe as scaling pathology of unindexed similarity search,
not substrate deadlocks. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.
Architectural commitments:
- One in-flight job per sidecar (per graph). Concurrent jobs against
the same graph would race on graphiti-core's bulk-resolve path (no
transaction boundary). Concurrent multi-tenancy is 'run multiple
sidecars,' not 'make one sidecar concurrency-safe across graphs.'
- Postgres-backed job state. Survives sidecar restart. On startup the
sidecar resets any 'running' rows to 'queued' (their previous run
died); the background worker picks them up naturally.
- Both endpoints async-shaped for parity. Bulk pathway preserved —
load-bearing for first-run corpus migration. Single-episode
preserved — load-bearing for state-superseding content per the
Stage 2/3 routing rule. graphiti-core's add_episode and
add_episode_bulk are unchanged underneath; the async wrapper sits
between the HTTP layer and the library call.
- Polling cadence: 2s flat at the worker, FOR UPDATE SKIP LOCKED so
the design is safe for future multi-sidecar deployment without
changes.
Postgres helpers (_pg, _job_insert, _job_get, _job_claim_next,
_job_complete, _job_fail, _startup_recovery) replace the synchronous
graphiti.add_episode call with persistent job state. Background worker
loop catches everything, logs everything, never dies from an unexpected
error.
SEARCH_INTERFACE BRIDGE
graphiti-core 0.29.0 builds FalkorSearchOperations as
driver._search_ops in FalkorDriver.__init__ but never assigns it to
driver.search_interface. search_utils.py:edge_similarity_search and
node_similarity_search check 'if driver.search_interface:' and
delegate when present, falling through to interpreted-Cypher cosine
math when not. The naming mismatch between the two halves of
graphiti-core means the per-driver implementation never gets used.
Bridge after Graphiti instance construction:
driver.search_interface = driver._search_ops
This activates the per-driver path which (with our vendored patches)
uses db.idx.vector.queryNodes for FalkorDB's native vector index.
Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds.
The bridge is also a candidate for an upstream PR — pick one name and
stick to it across the codebase. Tonight it's local.
Adds EXTRACTION_INSTRUCTIONS_V1 constant passed to the sidecar via
custom_extraction_instructions on both bulk and single-episode pathways.
graphiti-core inserts the text into entity and edge extraction prompts
only; it does NOT enter dedup prompts (that's the encoder-stays-naive
commitment).
Architectural posture: the encoder is content-naive. It does not draw on
prior knowledge of the user, the substrate, or the cycle's accumulated
work. Schema and personality live in the cycle's consolidated substrate
where the dream phase shapes them. The encoder produces source-grounded
ground truth for the cycle to work from.
Empirical validation in tonight's smoke test: 30+ verb-shaped predicates
from 3 chunks of real content, including IS_AUTOBIOGRAPHICAL_TO,
INFORMED_DESIGN_OF, EVALUATED_DOMAIN_PURITY, DISCONFIRMED_HYPOTHESIS_ABOUT.
Compare to default extraction's 4 predicate types across 22,289 edges.
RELATES_TO appears once as appropriate fallback rather than collapsing
everything generic.
Bumps WORKER_VERSION to 2.4.
Reads new routing columns from stage_3_queue (state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale) and dispatches each row to one of
two ingest pathways:
- BULK pathway (existing, renamed from ingest_to_graphiti to ingest_bulk):
safer-cheaper default. Used when supersedes=false OR confidence=low OR
routing fields are NULL (legacy rows). Skips edge invalidation per
graphiti-core's bulk semantics.
- SINGLE-EPISODE pathway (new, ingest_single_episode): used only when
supersedes_prior_state=true AND confidence in {medium, high}. Per-chunk
POST to /episodes (singular endpoint) with shared saga tag. Each call
independent — own timeout, own retry envelope.
Routing decision isolated in should_route_single_episode() with unit-tested
truth table covering all eight (supersedes × confidence) combinations.
Per-chunk heartbeat (heartbeat_row): single-episode pathway updates
stage_3_queue.started_at after each successful chunk POST so a long-running
document doesn't cross the 10-minute stale threshold mid-process and get
re-dequeued. started_at semantics now: 'last activity timestamp' rather
than 'began at'. Best-effort; failures logged not raised.
Partial-success on chunk failure: previously-committed chunks stay in the
graph; the function raises with detail (single_episode_partial: chunk N/M
failed, succeeded K). The row is marked failed_at with that detail. Re-
ingestion would re-POST chunks 1..N-1 against the graph; graphiti's dedup
handles them as no-ops.
DB connection scoping: process_one no longer holds one Postgres connection
across the whole ingest call (which can run an hour for long single-episode
documents). Each DB write gets a short-lived connection.
Phase A item 3 of three. Closes the mechanical-patches block. Item 4
(custom_extraction_instructions text design) is the remaining intellectual
work; sidecar and worker plumbing is now ready for it.
- BulkEpisodeRequest: new optional custom_extraction_instructions field
with comment noting graphiti-core inserts it into extract_nodes/extract_edges
prompts only, NOT dedupe prompts (verified by reading prompts directory)
- EpisodeRequest: new optional saga field, plumbed through to add_episode
for upcoming Stage 3 single-episode pathway
- Both handlers use conditional kwargs construction so existing callers
see no behavioral change
Phase A item 1 of three. Items 2 (stage2_worker) and 3 (stage3_worker) follow.
Mirrors stage2_worker v2.1 (da98019) resilience fixes:
- Absolute paths for /usr/bin/sudo and /bin/systemctl
- Log stdout/stderr when sidecar restart fails
- Reset consecutive_failures even when wedge recovery fails (prevents
permanent stuck state if restart itself is broken)
Three classes of silent failure converted to clean terminal states:
- Mistral timeout: previously left rows in zombie state (started_at set,
failed_at null, attempts incremented past retry threshold, row invisible
to selection query). Now sets failed_at with reason
'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents
accumulated in this state during the Stage 3 saga deadlock incident.
- Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on
JSON decode failure but process_one wasn't checking, so empty orientation
('Active frames: . Frame relationships: ...') was shipped to Stage 3.
This is F22 from the 2026-04-30 code review. Now sets failed_at with
reason 'mistral_parse_failure'.
- Wedge recovery hammering: consecutive_failures was only reset on
successful Ollama restart. With the sudo path bug (also fixed here),
recovery always failed, so every subsequent failure re-attempted restart.
Now resets the counter regardless and logs the failure visibly.
Also: subprocess.run now uses absolute paths (/usr/bin/sudo,
/bin/systemctl) instead of relying on PATH, fixing the 'No such file or
directory: sudo' error that broke Stage 2's recover_wedge() since
deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the
PATH issue was masking that fix.
Worker version bumped to 2.1 to match Stage 3's resilience patch level.
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.
stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
commits, all sharing the same saga tag for Graphiti document linking.
Original implementation sent all chunks as one saga; 17-19 chunk sagas
deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.
Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
Workers have no sd_notify support; per-request timeouts in code
handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
systemctl restart ollama and restart aaronai-graphiti.service.
Stage 2's existing recover_wedge() was silently broken since
deployment due to this gap.
.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).
Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.
Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.
- api.py: strip CV pinning workaround (parity violation, see architecture doc)
- dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches
3x and filters in-process. Was silently dropping the parameter; would have
confounded E3 with broken cross-stage exclusion in Graphiti arm.
- watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was
propagating through entire cascade. Postgres TEXT can hold up to 1GB.
- corpus_integrity.py: F37 — same truncation, third path now clean.
Backups: api.py.bak.*, dream.py.bak.*, watcher.py.bak.*, ingest.py.bak.*,
corpus_integrity.py.bak.* timestamped pre-fix.
Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected
by F14, 414KB).