Commit Graph

65 Commits

Author SHA1 Message Date
aaron 7f07972109 stage2_worker: ON CONFLICT clause resets all run-state fields on re-enqueue
Bug: when a row in stage_3_queue gets re-enqueued (same source ingested
again after Stage 2 re-runs), the ON CONFLICT (source) DO UPDATE clause
updated content fields and reset enqueued_at, completed_at, failed_at,
attempts — but did not reset started_at, failure_reason, or
external_job_id.

Stale started_at from a prior attempt makes the row invisible to the
Stage 3 worker's claim filter (which uses started_at IS NULL). The row
sits queued forever; Stage 3 never picks it up; the source effectively
fails silently after a re-trigger.

Discovered tonight while testing the bulk pathway after the substrate
fix: a journal entry that had been ingested earlier (and manually marked
completed during recovery from a worker timeout) showed enqueued_at
from the new touch but started_at from the original 01:40 attempt. Fix
extends the upsert clause to NULL all run-state fields so re-enqueue
behaves as 'fresh attempt.'

After fix, re-triggered journal entry routed cleanly through Stage 2 →
Stage 3 → bulk pathway → sidecar bulk job → 60ms commit (worst-case
dedup against already-known content).
2026-05-02 05:20:14 +00:00
aaron f645b74b1c graphiti_service: v2.0 — Pattern 1 async job model + search_interface bridge
Major rewrite of the Graphiti sidecar. Two architectural changes:

PATTERN 1 ASYNC JOB MODEL

Submission and completion are decoupled. POST /episodes and
POST /episodes/bulk return job_id immediately; the actual graphiti-core
work happens in a background asyncio task. Submitters poll
GET /jobs/{job_id} until terminal status (committed | failed).

Why: tonight's smoke test confirmed that bulk ingest against the
4,222-entity graph was committing successfully even when the worker's
HTTP read-timeout fired. The synchronous interface was producing
false-negative failures — work succeeded but the worker stopped
listening at the 10-minute read-timeout. Three days of 'saga deadlock'
failures reframe as scaling pathology of unindexed similarity search,
not substrate deadlocks. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.

Architectural commitments:

- One in-flight job per sidecar (per graph). Concurrent jobs against
  the same graph would race on graphiti-core's bulk-resolve path (no
  transaction boundary). Concurrent multi-tenancy is 'run multiple
  sidecars,' not 'make one sidecar concurrency-safe across graphs.'

- Postgres-backed job state. Survives sidecar restart. On startup the
  sidecar resets any 'running' rows to 'queued' (their previous run
  died); the background worker picks them up naturally.

- Both endpoints async-shaped for parity. Bulk pathway preserved —
  load-bearing for first-run corpus migration. Single-episode
  preserved — load-bearing for state-superseding content per the
  Stage 2/3 routing rule. graphiti-core's add_episode and
  add_episode_bulk are unchanged underneath; the async wrapper sits
  between the HTTP layer and the library call.

- Polling cadence: 2s flat at the worker, FOR UPDATE SKIP LOCKED so
  the design is safe for future multi-sidecar deployment without
  changes.

Postgres helpers (_pg, _job_insert, _job_get, _job_claim_next,
_job_complete, _job_fail, _startup_recovery) replace the synchronous
graphiti.add_episode call with persistent job state. Background worker
loop catches everything, logs everything, never dies from an unexpected
error.

SEARCH_INTERFACE BRIDGE

graphiti-core 0.29.0 builds FalkorSearchOperations as
driver._search_ops in FalkorDriver.__init__ but never assigns it to
driver.search_interface. search_utils.py:edge_similarity_search and
node_similarity_search check 'if driver.search_interface:' and
delegate when present, falling through to interpreted-Cypher cosine
math when not. The naming mismatch between the two halves of
graphiti-core means the per-driver implementation never gets used.

Bridge after Graphiti instance construction:
  driver.search_interface = driver._search_ops

This activates the per-driver path which (with our vendored patches)
uses db.idx.vector.queryNodes for FalkorDB's native vector index.
Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds.

The bridge is also a candidate for an upstream PR — pick one name and
stick to it across the codebase. Tonight it's local.
2026-05-02 05:19:46 +00:00
aaron d7b2a850c4 stage3_worker: v2.4 — encoder extraction instructions v1.0
Adds EXTRACTION_INSTRUCTIONS_V1 constant passed to the sidecar via
custom_extraction_instructions on both bulk and single-episode pathways.
graphiti-core inserts the text into entity and edge extraction prompts
only; it does NOT enter dedup prompts (that's the encoder-stays-naive
commitment).

Architectural posture: the encoder is content-naive. It does not draw on
prior knowledge of the user, the substrate, or the cycle's accumulated
work. Schema and personality live in the cycle's consolidated substrate
where the dream phase shapes them. The encoder produces source-grounded
ground truth for the cycle to work from.

Empirical validation in tonight's smoke test: 30+ verb-shaped predicates
from 3 chunks of real content, including IS_AUTOBIOGRAPHICAL_TO,
INFORMED_DESIGN_OF, EVALUATED_DOMAIN_PURITY, DISCONFIRMED_HYPOTHESIS_ABOUT.
Compare to default extraction's 4 predicate types across 22,289 edges.
RELATES_TO appears once as appropriate fallback rather than collapsing
everything generic.

Bumps WORKER_VERSION to 2.4.
2026-05-02 05:15:17 +00:00
aaron e7de7fb64b stage3_worker: v2.3 — bulk-vs-single-episode routing on Stage 2 state-type
Reads new routing columns from stage_3_queue (state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale) and dispatches each row to one of
two ingest pathways:

  - BULK pathway (existing, renamed from ingest_to_graphiti to ingest_bulk):
    safer-cheaper default. Used when supersedes=false OR confidence=low OR
    routing fields are NULL (legacy rows). Skips edge invalidation per
    graphiti-core's bulk semantics.

  - SINGLE-EPISODE pathway (new, ingest_single_episode): used only when
    supersedes_prior_state=true AND confidence in {medium, high}. Per-chunk
    POST to /episodes (singular endpoint) with shared saga tag. Each call
    independent — own timeout, own retry envelope.

Routing decision isolated in should_route_single_episode() with unit-tested
truth table covering all eight (supersedes × confidence) combinations.

Per-chunk heartbeat (heartbeat_row): single-episode pathway updates
stage_3_queue.started_at after each successful chunk POST so a long-running
document doesn't cross the 10-minute stale threshold mid-process and get
re-dequeued. started_at semantics now: 'last activity timestamp' rather
than 'began at'. Best-effort; failures logged not raised.

Partial-success on chunk failure: previously-committed chunks stay in the
graph; the function raises with detail (single_episode_partial: chunk N/M
failed, succeeded K). The row is marked failed_at with that detail. Re-
ingestion would re-POST chunks 1..N-1 against the graph; graphiti's dedup
handles them as no-ops.

DB connection scoping: process_one no longer holds one Postgres connection
across the whole ingest call (which can run an hour for long single-episode
documents). Each DB write gets a short-lived connection.

Phase A item 3 of three. Closes the mechanical-patches block. Item 4
(custom_extraction_instructions text design) is the remaining intellectual
work; sidecar and worker plumbing is now ready for it.
2026-05-01 19:07:41 +00:00
aaron 70e87e3ab5 stage2_worker: v2.2 — add state-type classification for Stage 3 routing
Mistral pass now produces two concerns in a single flat JSON output:
  (a) orientation context (existing four fields, unchanged semantics)
  (b) state-type classification: state_type (current/reference/historical),
      state_type_confidence (low/medium/high), supersedes_prior_state (bool),
      state_type_rationale (text)

Routing fields written as explicit columns on stage_3_queue (separate
ALTER TABLE migration adds them: state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale + index on supersedes).

Safe-cheap defaults on malformed Mistral output: state_type='reference',
confidence='low', supersedes=false. All defaults route to bulk pathway
(no temporal invalidation cost) so Mistral parse drift can't accidentally
trigger expensive single-episode ingest.

Phase A item 2 of three. Sidecar (item 1, commit 8b0a163) already plumbs
custom_extraction_instructions through to /episodes/bulk. Stage 3 routing
logic (item 3) follows.
2026-05-01 19:02:11 +00:00
aaron 8b0a163670 graphiti_service: expose custom_extraction_instructions on /episodes/bulk; add saga on /episodes
- BulkEpisodeRequest: new optional custom_extraction_instructions field
  with comment noting graphiti-core inserts it into extract_nodes/extract_edges
  prompts only, NOT dedupe prompts (verified by reading prompts directory)
- EpisodeRequest: new optional saga field, plumbed through to add_episode
  for upcoming Stage 3 single-episode pathway
- Both handlers use conditional kwargs construction so existing callers
  see no behavioral change

Phase A item 1 of three. Items 2 (stage2_worker) and 3 (stage3_worker) follow.
2026-05-01 18:57:31 +00:00
aaron 1a8e0353f5 stage3_worker: v2.2 — absolute sudo/systemctl paths, error logging, reset failure counter on recovery failure
Mirrors stage2_worker v2.1 (da98019) resilience fixes:
- Absolute paths for /usr/bin/sudo and /bin/systemctl
- Log stdout/stderr when sidecar restart fails
- Reset consecutive_failures even when wedge recovery fails (prevents
  permanent stuck state if restart itself is broken)
2026-05-01 18:40:25 +00:00
aaron da980193dd stage2_worker: v2.1 — terminal failure states + sudo path fix
Three classes of silent failure converted to clean terminal states:

- Mistral timeout: previously left rows in zombie state (started_at set,
  failed_at null, attempts incremented past retry threshold, row invisible
  to selection query). Now sets failed_at with reason
  'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents
  accumulated in this state during the Stage 3 saga deadlock incident.

- Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on
  JSON decode failure but process_one wasn't checking, so empty orientation
  ('Active frames: . Frame relationships: ...') was shipped to Stage 3.
  This is F22 from the 2026-04-30 code review. Now sets failed_at with
  reason 'mistral_parse_failure'.

- Wedge recovery hammering: consecutive_failures was only reset on
  successful Ollama restart. With the sudo path bug (also fixed here),
  recovery always failed, so every subsequent failure re-attempted restart.
  Now resets the counter regardless and logs the failure visibly.

Also: subprocess.run now uses absolute paths (/usr/bin/sudo,
/bin/systemctl) instead of relying on PATH, fixing the 'No such file or
directory: sudo' error that broke Stage 2's recover_wedge() since
deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the
PATH issue was masking that fix.

Worker version bumped to 2.1 to match Stage 3's resilience patch level.
2026-05-01 17:28:53 +00:00
aaron b936931668 Stage 3 worker v2.1 — saga-size limit + wedge detection + sudoers fixes
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.

stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
  commits, all sharing the same saga tag for Graphiti document linking.
  Original implementation sent all chunks as one saga; 17-19 chunk sagas
  deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
  consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
  escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.

Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
  Workers have no sd_notify support; per-request timeouts in code
  handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
  systemctl restart ollama and restart aaronai-graphiti.service.
  Stage 2's existing recover_wedge() was silently broken since
  deployment due to this gap.

.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).

Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.

Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.
2026-05-01 05:18:09 +00:00
aaron 465f2f725b Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37
- api.py: strip CV pinning workaround (parity violation, see architecture doc)
- dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches
  3x and filters in-process. Was silently dropping the parameter; would have
  confounded E3 with broken cross-stage exclusion in Graphiti arm.
- watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was
  propagating through entire cascade. Postgres TEXT can hold up to 1GB.
- corpus_integrity.py: F37 — same truncation, third path now clean.

Backups: api.py.bak.*, dream.py.bak.*, watcher.py.bak.*, ingest.py.bak.*,
corpus_integrity.py.bak.* timestamped pre-fix.

Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected
by F14, 414KB).
2026-05-01 02:26:37 +00:00
aaron 25e42c0231 corpus_integrity.py: write unreadables with retry_count=0 so OCR can retry when it ships 2026-04-30 22:03:48 +00:00
aaron 7822fb1cc1 corpus_integrity.py: write unreadable files to ingest_failures for UI visibility 2026-04-30 21:59:06 +00:00
aaron 74e2c34f43 corpus integrity: ingest_failures tracking in watcher, reconciliation script, corpus status/retry/reconcile endpoints 2026-04-30 21:54:39 +00:00
aaron f11cacd9c9 add experiment scripts and results; watcher.py latest changes 2026-04-30 18:06:03 +00:00
aaron 1cf26df450 api.py: return error_type=transcription_failed on Whisper crash, frontend retry logic can now distinguish from network failures 2026-04-30 17:45:47 +00:00
aaron 7cd765146a stage3_worker.py: log sidecar response body on non-200 2026-04-30 17:37:28 +00:00
aaron 58515ebec0 graphiti_service.py: add traceback logging, log file handler for all endpoints 2026-04-30 17:36:19 +00:00
aaron 91166367fa E3: add Graphiti retrieval branch to dream.py, E3 experiment script with blinding 2026-04-30 17:17:28 +00:00
aaron 2b3c2380a0 watcher.py: in-process ingest, embedder loaded once at startup, startup recovery, heartbeat, no duplicate logging 2026-04-30 16:42:44 +00:00
aaron 2fb50cce71 ingest.py: guard Stage 2 enqueue behind SKIP_STAGE2_ENQUEUE env var for migration runs 2026-04-30 16:20:11 +00:00
aaron c08f57a6f2 stage2/3 workers: remove duplicate StreamHandler, stdout captured by systemd 2026-04-30 16:12:51 +00:00
aaron cae7fb8775 dream.py v1.1: score-band exclusion for Early REM, DREAMER_VERSION constant, manifest versioning 2026-04-30 15:51:11 +00:00
aaron b53717af5b dream.py: enrich manifest with retrieval breadth metrics 2026-04-30 06:14:55 +00:00
aaron 2b9a1782c1 feat: stage2/3 pipeline, taxonomy-free cascade, E1.8/E4 experiments, corpus migration state 2026-04-30 04:04:31 +00:00
aaron 62b5b5453a fix: max_coroutines=2, saga support in sidecar; stage3 chunking; TIMEOUT_MAX 0 persistent in falkordb compose 2026-04-30 04:01:02 +00:00
aaron 95d022ec64 fix: FalkorDriver database=aaron, build indices on correct graph 2026-04-29 21:34:20 +00:00
aaron d91a5675ff capture: public SSE endpoint for transcription completion events 2026-04-29 18:00:54 +00:00
aaron c42d898504 emit capture_saved SSE event when async transcription completes 2026-04-29 17:58:01 +00:00
aaron a05fcec882 async voice transcription — return immediately, whisper runs in background 2026-04-29 17:48:22 +00:00
aaron eb7cf3be10 upgrade whisper small -> large-v3, bump cpu_threads to 8 2026-04-29 17:35:03 +00:00
aaron 3f6c435be4 add client_time to chat context — user-supplied, not logged 2026-04-29 17:26:03 +00:00
aaron 21557790d9 capture: return error_type on transcription failure instead of HTTP 500 2026-04-29 17:04:56 +00:00
aaron 794e0aeddd update whisper prompt: add BirdAI stack terms, remove stale ChromaDB 2026-04-29 16:47:30 +00:00
aaron d271e17929 add sourcing constraint to system prompt, close hallucination gap 2026-04-29 16:37:39 +00:00
aaron 5d83fb7601 fix: load_dotenv override=True, option b source exclusion 2026-04-29 16:32:09 +00:00
aaron 83d4f60d0d option b: cross-night source exclusion in dream pipeline 2026-04-29 16:19:52 +00:00
aaron b6fe350ab2 experiments: add consistency test and briefing generator results + scripts 2026-04-28 02:47:41 +00:00
aaron 037d747573 chore: archive deprecated chromadb and migration scripts 2026-04-28 00:15:46 +00:00
aaron d5b5c2ec14 Graphiti sidecar service + SentenceTransformer embedder — self-hosted, no OpenAI dependency 2026-04-27 18:21:22 +00:00
aaron 4ee2567400 Add SentenceTransformer embedder for Graphiti — self-hosted, no OpenAI dependency 2026-04-27 18:18:37 +00:00
aaron a1f732fc9e Dreamer: manifest writer, Late REM v1.2 (remove coherence pull) 2026-04-27 16:54:18 +00:00
aaron 03b3f012c3 Dreamer: prompt versioning, Early REM v1.1, prompt signature in headers 2026-04-27 16:50:21 +00:00
aaron 6776637178 Remove hardcoded PG password fallbacks — require PG_DSN env var in all scripts 2026-04-27 05:16:37 +00:00
aaron a1f5c1049a Fix dreamer status display, watcher excludes Media/, remove NVM debt item 2026-04-27 05:08:01 +00:00
aaron d3239aba17 Image capture — extend /api/capture for image+voice, Claude vision description, Media/ WebDAV, watcher excludes Media/ 2026-04-27 04:28:31 +00:00
aaron ef2fddc47f Redesign dreamer — interdependent pipeline, NREM→Early REM→Late REM→Synthesis 2026-04-26 23:41:24 -04:00
aaron 7af246ac01 APScheduler — replace systemd timers, in-process dream and ingest scheduling 2026-04-27 03:04:33 +00:00
aaron 9b312d936f Add SSE endpoint and dream notify — /api/events and /api/events/notify 2026-04-27 02:20:50 +00:00
aaron 9088b5643d Add /api/dreamer/status and /api/dreamer/run endpoints 2026-04-27 01:27:09 +00:00
aaron a07de922df Add /api/capture and /api/captures endpoints — auth-free, WebDAV delivery to Journal/Captures/ 2026-04-26 22:39:55 +00:00