Compare commits

..

9 Commits

Author SHA1 Message Date
aaron 7f07972109 stage2_worker: ON CONFLICT clause resets all run-state fields on re-enqueue
Bug: when a row in stage_3_queue gets re-enqueued (same source ingested
again after Stage 2 re-runs), the ON CONFLICT (source) DO UPDATE clause
updated content fields and reset enqueued_at, completed_at, failed_at,
attempts — but did not reset started_at, failure_reason, or
external_job_id.

Stale started_at from a prior attempt makes the row invisible to the
Stage 3 worker's claim filter (which uses started_at IS NULL). The row
sits queued forever; Stage 3 never picks it up; the source effectively
fails silently after a re-trigger.

Discovered tonight while testing the bulk pathway after the substrate
fix: a journal entry that had been ingested earlier (and manually marked
completed during recovery from a worker timeout) showed enqueued_at
from the new touch but started_at from the original 01:40 attempt. Fix
extends the upsert clause to NULL all run-state fields so re-enqueue
behaves as 'fresh attempt.'

After fix, re-triggered journal entry routed cleanly through Stage 2 →
Stage 3 → bulk pathway → sidecar bulk job → 60ms commit (worst-case
dedup against already-known content).
2026-05-02 05:20:14 +00:00
aaron f645b74b1c graphiti_service: v2.0 — Pattern 1 async job model + search_interface bridge
Major rewrite of the Graphiti sidecar. Two architectural changes:

PATTERN 1 ASYNC JOB MODEL

Submission and completion are decoupled. POST /episodes and
POST /episodes/bulk return job_id immediately; the actual graphiti-core
work happens in a background asyncio task. Submitters poll
GET /jobs/{job_id} until terminal status (committed | failed).

Why: tonight's smoke test confirmed that bulk ingest against the
4,222-entity graph was committing successfully even when the worker's
HTTP read-timeout fired. The synchronous interface was producing
false-negative failures — work succeeded but the worker stopped
listening at the 10-minute read-timeout. Three days of 'saga deadlock'
failures reframe as scaling pathology of unindexed similarity search,
not substrate deadlocks. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.

Architectural commitments:

- One in-flight job per sidecar (per graph). Concurrent jobs against
  the same graph would race on graphiti-core's bulk-resolve path (no
  transaction boundary). Concurrent multi-tenancy is 'run multiple
  sidecars,' not 'make one sidecar concurrency-safe across graphs.'

- Postgres-backed job state. Survives sidecar restart. On startup the
  sidecar resets any 'running' rows to 'queued' (their previous run
  died); the background worker picks them up naturally.

- Both endpoints async-shaped for parity. Bulk pathway preserved —
  load-bearing for first-run corpus migration. Single-episode
  preserved — load-bearing for state-superseding content per the
  Stage 2/3 routing rule. graphiti-core's add_episode and
  add_episode_bulk are unchanged underneath; the async wrapper sits
  between the HTTP layer and the library call.

- Polling cadence: 2s flat at the worker, FOR UPDATE SKIP LOCKED so
  the design is safe for future multi-sidecar deployment without
  changes.

Postgres helpers (_pg, _job_insert, _job_get, _job_claim_next,
_job_complete, _job_fail, _startup_recovery) replace the synchronous
graphiti.add_episode call with persistent job state. Background worker
loop catches everything, logs everything, never dies from an unexpected
error.

SEARCH_INTERFACE BRIDGE

graphiti-core 0.29.0 builds FalkorSearchOperations as
driver._search_ops in FalkorDriver.__init__ but never assigns it to
driver.search_interface. search_utils.py:edge_similarity_search and
node_similarity_search check 'if driver.search_interface:' and
delegate when present, falling through to interpreted-Cypher cosine
math when not. The naming mismatch between the two halves of
graphiti-core means the per-driver implementation never gets used.

Bridge after Graphiti instance construction:
  driver.search_interface = driver._search_ops

This activates the per-driver path which (with our vendored patches)
uses db.idx.vector.queryNodes for FalkorDB's native vector index.
Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds.

The bridge is also a candidate for an upstream PR — pick one name and
stick to it across the codebase. Tonight it's local.
2026-05-02 05:19:46 +00:00
aaron c0e6159b5e graphiti_patches: vendored FalkorDB vector index support for graphiti-core 0.29.0
Adds native FalkorDB vector index support to graphiti-core's FalkorDB
driver. Three patched files (graph_queries.py, falkordb_driver.py,
falkordb/operations/search_ops.py) plus apply.sh that backs up venv
files and copies patches over.

Why this exists: graphiti-core 0.29.0 builds similarity queries using
interpreted Cypher cosine math (vec.cosineDistance) which produces a
full-table scan over Entity/RELATES_TO/Community nodes for every search.
At ~4,000+ entities, single-episode add_episode took 8+ minutes for the
resolve-against-existing-graph step and bulk ingest hung indefinitely.
FalkorDB itself supports db.idx.vector.queryNodes and queryRelationships
procedures backed by HNSW indexes; the driver just doesn't use them.

Patches:

1. graph_queries.py — adds get_vector_indices() returning CREATE VECTOR
   INDEX statements for FalkorDB (Entity.name_embedding,
   RELATES_TO.fact_embedding, Community.name_embedding). HNSW with
   cosine similarity. Adds VECTOR_INDEX_CANDIDATE_MULTIPLIER for
   over-fetch when WHERE filters reject some top-k results. Original
   get_vector_cosine_func_query preserved for fallback.

2. falkordb_driver.py — extends build_indices_and_constraints() to call
   get_vector_indices() alongside range and fulltext. Adds cache
   invalidation hook so the search_ops dispatcher re-probes for indexes
   after they're built.

3. falkordb/operations/search_ops.py — adds vector-index dispatcher
   helpers (_falkordb_vector_index_exists with module-level cache,
   _falkordb_vector_node_search_cypher, _falkordb_vector_edge_search_cypher).
   Rewrites the three vector-similarity call sites (Entity.name_embedding,
   RELATES_TO.fact_embedding, Community.name_embedding) to use
   db.idx.vector.queryNodes / queryRelationships when available, fall
   back to interpreted-Cypher cosine math when not. Index existence
   probed once per (label, attribute, entity_type) and cached.

Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds. Bulk re-ingest of
already-known content (worst case for entity dedup) committed in 60ms.

Activation requires bridging driver._search_ops to driver.search_interface
in the sidecar (see graphiti_service.py). graphiti-core declares
search_interface as the dispatcher attribute but never assigns the
per-driver implementation to it — naming mismatch in their internal
refactor. The bridge is one line in our sidecar's lifespan.

Upstream candidate: this is a known gap (referenced indirectly in
upstream issue #1263 RFC for external vector store overlay). Maintainers'
attention is on Milvus/Qdrant/Pinecone overlay; this is the FalkorDB-
native alternative for users who don't want to run a separate vector DB.
PR after empirical validation in production. Apache-2.0 graphiti-core
source is NOT vendored — backups/ is gitignored to keep the upstream
source out of this repo.
2026-05-02 05:19:01 +00:00
aaron d7b2a850c4 stage3_worker: v2.4 — encoder extraction instructions v1.0
Adds EXTRACTION_INSTRUCTIONS_V1 constant passed to the sidecar via
custom_extraction_instructions on both bulk and single-episode pathways.
graphiti-core inserts the text into entity and edge extraction prompts
only; it does NOT enter dedup prompts (that's the encoder-stays-naive
commitment).

Architectural posture: the encoder is content-naive. It does not draw on
prior knowledge of the user, the substrate, or the cycle's accumulated
work. Schema and personality live in the cycle's consolidated substrate
where the dream phase shapes them. The encoder produces source-grounded
ground truth for the cycle to work from.

Empirical validation in tonight's smoke test: 30+ verb-shaped predicates
from 3 chunks of real content, including IS_AUTOBIOGRAPHICAL_TO,
INFORMED_DESIGN_OF, EVALUATED_DOMAIN_PURITY, DISCONFIRMED_HYPOTHESIS_ABOUT.
Compare to default extraction's 4 predicate types across 22,289 edges.
RELATES_TO appears once as appropriate fallback rather than collapsing
everything generic.

Bumps WORKER_VERSION to 2.4.
2026-05-02 05:15:17 +00:00
aaron a0bf280075 Add Pattern 1 async job model migration
Adds graphiti_jobs table for sidecar's async ingest queue and
external_job_id column on stage_3_queue for worker's polling reference.

Tonight's smoke test diagnosed that bulk ingest against the 4,222-entity
graph commits successfully but the worker's 600s HTTP read-timeout fires
before the sidecar's response returns. Three days of 'saga deadlock'
failures were false negatives — the work succeeded; the worker just
stopped listening. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.

Migration only — sidecar and worker code changes follow in subsequent
commits.
2026-05-02 02:22:30 +00:00
aaron 30beeb3a26 migrations: retroactively track stage_3_queue routing columns
Adds migrations/ directory with README documenting the convention
(timestamped filenames, idempotent SQL, forward-only, single change per file).

First migration is the Stage 3 queue routing columns added live during
Phase A patches today: state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale, plus index on supersedes.
Required by stage2_worker.py >= 2.2 and stage3_worker.py >= 2.3.

Idempotent (IF NOT EXISTS), safe to re-apply. Verified by re-applying
against the live DB — no changes, no errors.

Closes a reproducibility gap: a fresh DB provisioned from git would crash
on first Stage 2 enqueue without these columns. Now the SQL travels with
the code.
2026-05-01 19:11:09 +00:00
aaron e7de7fb64b stage3_worker: v2.3 — bulk-vs-single-episode routing on Stage 2 state-type
Reads new routing columns from stage_3_queue (state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale) and dispatches each row to one of
two ingest pathways:

  - BULK pathway (existing, renamed from ingest_to_graphiti to ingest_bulk):
    safer-cheaper default. Used when supersedes=false OR confidence=low OR
    routing fields are NULL (legacy rows). Skips edge invalidation per
    graphiti-core's bulk semantics.

  - SINGLE-EPISODE pathway (new, ingest_single_episode): used only when
    supersedes_prior_state=true AND confidence in {medium, high}. Per-chunk
    POST to /episodes (singular endpoint) with shared saga tag. Each call
    independent — own timeout, own retry envelope.

Routing decision isolated in should_route_single_episode() with unit-tested
truth table covering all eight (supersedes × confidence) combinations.

Per-chunk heartbeat (heartbeat_row): single-episode pathway updates
stage_3_queue.started_at after each successful chunk POST so a long-running
document doesn't cross the 10-minute stale threshold mid-process and get
re-dequeued. started_at semantics now: 'last activity timestamp' rather
than 'began at'. Best-effort; failures logged not raised.

Partial-success on chunk failure: previously-committed chunks stay in the
graph; the function raises with detail (single_episode_partial: chunk N/M
failed, succeeded K). The row is marked failed_at with that detail. Re-
ingestion would re-POST chunks 1..N-1 against the graph; graphiti's dedup
handles them as no-ops.

DB connection scoping: process_one no longer holds one Postgres connection
across the whole ingest call (which can run an hour for long single-episode
documents). Each DB write gets a short-lived connection.

Phase A item 3 of three. Closes the mechanical-patches block. Item 4
(custom_extraction_instructions text design) is the remaining intellectual
work; sidecar and worker plumbing is now ready for it.
2026-05-01 19:07:41 +00:00
aaron 70e87e3ab5 stage2_worker: v2.2 — add state-type classification for Stage 3 routing
Mistral pass now produces two concerns in a single flat JSON output:
  (a) orientation context (existing four fields, unchanged semantics)
  (b) state-type classification: state_type (current/reference/historical),
      state_type_confidence (low/medium/high), supersedes_prior_state (bool),
      state_type_rationale (text)

Routing fields written as explicit columns on stage_3_queue (separate
ALTER TABLE migration adds them: state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale + index on supersedes).

Safe-cheap defaults on malformed Mistral output: state_type='reference',
confidence='low', supersedes=false. All defaults route to bulk pathway
(no temporal invalidation cost) so Mistral parse drift can't accidentally
trigger expensive single-episode ingest.

Phase A item 2 of three. Sidecar (item 1, commit 8b0a163) already plumbs
custom_extraction_instructions through to /episodes/bulk. Stage 3 routing
logic (item 3) follows.
2026-05-01 19:02:11 +00:00
aaron 8b0a163670 graphiti_service: expose custom_extraction_instructions on /episodes/bulk; add saga on /episodes
- BulkEpisodeRequest: new optional custom_extraction_instructions field
  with comment noting graphiti-core inserts it into extract_nodes/extract_edges
  prompts only, NOT dedupe prompts (verified by reading prompts directory)
- EpisodeRequest: new optional saga field, plumbed through to add_episode
  for upcoming Stage 3 single-episode pathway
- Both handlers use conditional kwargs construction so existing callers
  see no behavioral change

Phase A item 1 of three. Items 2 (stage2_worker) and 3 (stage3_worker) follow.
2026-05-01 18:57:31 +00:00
61 changed files with 1564 additions and 7157 deletions
-1
View File
@@ -8,7 +8,6 @@ dreamer_state.json
corpus_integrity_report.json
watcher_state.json
watcher_status.json
reindex_status.json
# Logs (these belong in /var/log/)
*.log
@@ -1,846 +0,0 @@
# BirdAI Component Inventory — 2026-05-02
*Track 1 stabilization, deliverable 1. Read-only investigation.*
**Repo state:** HEAD `7615ded` (NREM exclusion fix) on baseline `1a8e035`. Last night's experimental work was reverted.
**Method:** Each component classified Working / Working-degraded / Broken / In-flight / Experimental / Stopped / Deprecated, with last-touched date from `git log -1`, dependencies, dependents, and a behavior-vs-intent column comparing observed code against `aaronai-architecture.md` and `aaronai-architecture-reframe-2026-05-01.md`.
**A note on terminology.** "Behavior matches intent" is read against two intent surfaces: (1) the architecture doc as written, which still frames graphiti as the target memory layer, and (2) the reframe doc, which supersedes parts of the architecture doc and which the bespoke decision now extends. Where the two diverge, the reframe is treated as canonical for purposes of this inventory; the architecture-doc-only divergences are flagged separately.
---
## Findings summary
This inventory's most useful work is identifying mechanisms that are running silently, without errors, while doing something the architecture didn't ask for. The 2026-05-02 NREM exclusion bug had that exact shape: NREM was excluding prior traces, the dreamer logged "completed," files appeared on schedule, and the architecture's stated commitment (NREM is replay-and-consolidation) was being violated invisibly. Track 1's job is to find the rest of those before they accumulate.
### Top-priority NREM-shaped divergences (working, but doing something the architecture didn't request)
These are the items most worth reading the linked phase entries for. They are ranked by potential impact on Track 1 or on subsequent E6-class work.
1. **`dream.py` cumulative cross-night exclusion (500-cap).** Phase 1, `dream.py`. Early REM and Late REM exclude up to 500 prior sources accumulated across nights. On a 1,200-source corpus this hides ~40% of the corpus from those modes after the cap fills, and trims to 400 only when overflowing — a churn pattern, not an architectural choice. The architecture and reframe specify session-scoped novelty; cumulative-across-nights exclusion is nowhere documented. Same shape as the NREM bug — a deduplication mechanism running silently, the architecture didn't request, and nobody noticed. **This is the highest-priority finding from the inventory.**
2. **`api.py /api/corpus/retry` reintroduces 50KB truncation.** Phase 1, `api.py`. The F14 fix removed truncation from `watcher.py`, `ingest.py`, and `corpus_integrity.py` on 2026-05-01. The retry endpoint at line 1074 still writes `text[:50000]`. Clicking "Retry" on an ingest-failed file in the SettingsPanel re-introduces exactly the bug F14 fixed. Working without errors; doing the wrong thing.
3. **`aaronai-stage3.service` is `enabled` while `inactive`.** Phase 2. The session brief says Stage 3 is stopped manually. The unit is `enabled`, so on next reboot the worker auto-starts and resumes processing the `stage_3_queue` rows that Stage 2 has been adding. The "stopped" state is paper-thin. `systemctl disable` would harden it; nobody has done that yet.
4. **Stage 2 keeps enqueuing to `stage_3_queue` while Stage 3 is off.** Phase 3. As of inventory time, 6 pending rows sit in `stage_3_queue`, last enqueued 2026-05-02 22:33 UTC. The queue grows until Stage 3 is restarted (and then catches up) or stopped at the producer. Nothing is broken — but the system is doing work whose output sits unconsumed.
5. **`embeddings.type` NULL for 71% of rows; `embeddings.created_at` text-typed and NULL for 87%.** Phase 3. The architecture treats these fields as load-bearing for "type-aware retrieval" and "temporal awareness." In production, most chunks lack both. Retrieval still works because nothing routes on either field. The doc's commitment and the data shape disagree, invisibly to anyone not querying the schema.
6. **`graphiti_jobs` documented as "empty" but holds 9 rows from the 2026-05-02 experimental run.** Phase 3. Current-state doc explicitly says "exists, empty (or near-empty)." Reality: 6 failed, 3 committed, all from the rolled-back code. Inert (no current code reads or writes), but the rollback narrative is incomplete on this point.
7. **`aaronai-maintenance.service` references ChromaDB.** Phase 2. The unit invokes `chops hnsw rebuild --path ~/aaronai/db --collection aaronai`. ChromaDB was retired 2026-04-26. `chops` is not in the venv. The `~/aaronai/db/` directory still exists with a ChromaDB sqlite. Saved from doing damage only because its timer is not enabled. A clean-room reading of `/etc/systemd/system/` would suggest BirdAI is still on ChromaDB.
8. **`aaronai-dreamer.service` hardcodes `--mode nrem`.** Phase 2. Production scheduling fires `dream.py` with no flag (default = full pipeline). The systemd entry-point is the historical "manual NREM" wrapper. Any future maintainer running `systemctl start aaronai-dreamer.service` from the shell expects "the dreamer" and gets only NREM.
9. **`dream_mode` setting in api.py defaults is silently ignored by the scheduler.** Phase 4. Setting in `DEFAULT_SETTINGS`, mergeable into `settings.json`, used by `update_settings` to decide whether to reschedule. Not actually read by `run_dream_job`. A configurable scheduling parameter that has no effect.
10. **Watcher-restart cron line uses sudo not in the sudoers file the session brief documents.** Phase 5. The 2026-05-01 sudoers fix listed `restart ollama` and `restart aaronai-graphiti.service`. The watcher-restart cron line uses `sudo systemctl restart aaronai-watcher`. Either there's an additional sudoers entry the brief doesn't mention, or this watchdog has been silently failing every fire. Worth checking `/var/log/aaronai/watcher-cron.log` (out of scope for this read-only inventory).
11. **`prompt_hash()` in `dream.py` hashes function `__doc__` strings, but none of the synth functions have docstrings.** Phase 1, `dream.py` notes (folded into the "F8" reference). The hash is deterministic across all dreams (always the MD5 of `""`). This is the architecture-doc tech-debt item F8 ("`prompt_hash` broken") confirmed in code: the manifest field meant to "catch undeclared drift" carries a constant value. Same shape as NREM: a mechanism present, running, doing something the architecture-stated purpose explicitly denies.
12. **Two parallel scheduling stacks.** Phase 5. APScheduler in `api.py` and three dormant `aaronai-*.timer` files. The dormant ones aren't firing, so no actual harm. The presence makes "what triggers the dream" harder to answer than it should be.
### Cross-cutting findings (not necessarily NREM-shaped)
- **The `scripts/` directory mixes 11 production files with 32 experimental scripts and ~20 `.bak` files.** Reading the directory it is hard to tell at-a-glance what is live. Track 1 cleanup candidate: move experimental files to `experiments/` (which already exists with a few) or `deprecated/`, and delete `.bak*` (git history is the durable record). This is mostly cosmetic but makes future inventories easier.
- **Two implementations of Stage 1 (F11) confirmed.** `watcher.py:ingest_file` and `ingest.py:ingest_file` (and `corpus_integrity.py:extract_text_for_retry` plus the api.py retry path) all reimplement extract-chunk-embed-write. The architecture doc records this as known tech debt; the inventory verifies all four call sites still drift.
- **The bespoke decision dissolves several components without removing them.** `consolidator_v0_1.py`, `tier1_migration.py`, `graphiti_service.py`, `stage3_worker.py`, both Stage 3 unused-column sets in `stage_3_queue`, `graphiti_jobs` table, the experiment scripts. None is actively harmful in current state; collectively they make the bespoke direction harder to read out of the codebase. Track 1 stripping is the right venue for these.
- **Memory-and-state fan-out.** The system has at least 7 distinct files outside the database that hold state: `dreamer_state.json`, `watcher_state.json`, `watcher_status.json`, `watcher_heartbeat`, `corpus_integrity_report.json`, `tier1_migration_state.json`, `settings.json`, plus two sqlite DBs (`conversations.db`, `sessions.db`) and a markdown file (`memory.md`). Bespoke design will likely consolidate.
### What looks fine
The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that matches the architecture doc and the parity principle exactly. The capture endpoint works as documented. The `ingest_failures` table reflects exactly the 129 unreadable files the architecture doc cites. The frontend route surface is minimal and entirely backed. The 2026-05-01 worker patches (saga-size limit, wedge detection, sudoers, no `WatchdogSec` without `sd_notify`) are visible and correct in code. The NREM exclusion fix is in place and the manual run on 2026-05-02 21:34 UTC produced a real dream.
### Where I am uncertain
- I did not read the watcher-cron.log, sudo configuration, or systemd journal directly. The "sudo for `aaronai-watcher` restart" question (Phase 5 / divergence #10) is based on the session brief's stated sudoers contents only.
- I did not exhaustively read each of the 32 experimental scripts. I read enough of each (header docstring) to classify; deep behavioral inspection of these is unnecessary for Track 1 but means I cannot rule out additional NREM-shape divergences inside them.
- I did not deep-read frontend components (`~/aaronai-web/components/`). Per Phase 6 scope.
- The session brief says Stage 3 is "stopped manually." I confirmed `systemctl is-active aaronai-stage3.service = inactive`. I did not confirm via `journalctl` when it was stopped — but the inventory doesn't need that, only the current state.
---
## Updates — 2026-05-03 session
*Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
### Resolved
- **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
### Corrections to existing findings
- **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
- **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
### New findings from 2026-05-03 frame analysis (Improvement #3)
- **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
- **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
- **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
### Artifacts produced 2026-05-03
- **Code change:** `scripts/dream.py` (Improvement #1).
- **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
- **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
- **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
- **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
---
## Phase 1 — Scripts
Inventory of every file under `~/aaronai/scripts/` (and `~/aaronai/scripts/experiments/`). `.bak*` files are listed at the bottom of the section but not individually documented; they are point-in-time snapshots from the rollback work and are not part of any active code path.
### `api.py`
- **Path:** `scripts/api.py`
- **Status:** Working
- **Last-touched:** 2026-05-01
- **What it does:** FastAPI backend on port 8000. Hosts the chat endpoint (`/api/chat`), session-based auth (`/auth/login`, `/auth/logout`, `/auth/check`), conversation CRUD, settings panel API, memory editor, status endpoint, audio transcription via faster-whisper `large-v3`, capture endpoint (voice and image+voice), dreamer-status and dreamer-run, corpus-integrity status / retry / reconcile, and SSE streams for both authenticated dreamer notifications and the public capture page. Embeds an APScheduler `BackgroundScheduler` that drives the nightly dream cycle and conversation ingest. Loads SentenceTransformers `all-MiniLM-L6-v2` and the Anthropic SDK at startup. Auth is a session token in a 30-day cookie backed by `sessions.db` (sqlite). Conversations and messages are in `conversations.db` (sqlite). Document retrieval is pure cosine similarity over pgvector (top-8, threshold 0.3) — the CV-pinning workaround was stripped 2026-04-30.
- **Dependencies:** `.env` (`PG_DSN`, `ANTHROPIC_API_KEY`, `AARON_AI_PASSWORD`, `NEXTCLOUD_*`); `~/aaronai/conversations.db`, `~/aaronai/sessions.db`, `~/aaronai/memory.md`, `~/aaronai/settings.json`, `~/aaronai/watcher_status.json`, `~/aaronai/watcher_state.json`, `~/aaronai/dreamer_state.json`, `~/aaronai/corpus_integrity_report.json`; PostgreSQL (`embeddings`, `stage_2_queue`, `ingest_failures`); SentenceTransformer model files; faster-whisper model files; the `dream.py`, `ingest.py`, and `corpus_integrity.py` scripts which it shells out to; Nextcloud WebDAV. Runs as `aaronai.service`.
- **What depends on it:** Frontend (`aaronai-web` Next.js) consumes every `/api/*` endpoint; mobile capture layer consumes `/api/capture` and `/api/captures/events`; `dream.py` POSTs to `/api/events/notify` to push SSE to the frontend; the APScheduler embedded in this process is the only thing that triggers the nightly dream cycle and the nightly conversation ingest in production.
- **Behavior matches intent?** Partial. Pure-similarity retrieval matches the post-2026-04-30 architecture statement. The `chat` function ignores `client_time` for memory retrieval purposes (just inserts it into the prompt), which is consistent with the doc. Two divergences worth flagging:
1. `/auth/check` references `SESSIONS` (line 385) which is undefined — this is dead code (no `SESSIONS` set/dict exists in the file). Auth checking on the frontend evidently relies on the cookie being present rather than this endpoint working; a request would `NameError` 500. Likely a leftover from an earlier in-memory session implementation that was migrated to sqlite without removing the check.
2. `transcribe_and_save()` (the background voice capture path, line 670) does NOT save the raw audio file to `Journal/Media/` — only the transcript markdown to `Journal/Captures/`. The architecture doc's "Multimedia Ingest Pipeline" describes `Journal/Media/YYYY-MM/` as the raw-ground-truth location for all captured media. The image+voice path does write image bytes to Media, but voice-only does not. A future Late REM "raw images during synthesis" feature listed as "not yet built" in the architecture doc relies on Media existing, but for voice this means the audio is gone after transcription. Flagged.
- **Notes:** APScheduler is created at module import (`scheduler = BackgroundScheduler()` at line 1105) and started in the lifespan. Stage 3 worker code is not invoked from here. The `/api/reindex` endpoint shells out to `ingest.py` which still writes to pgvector and (since `SKIP_STAGE2_ENQUEUE` is unset by default) re-enqueues to `stage_2_queue` — meaning a reindex can put files back through Stage 2 and Stage 3, which under the bespoke decision is no longer the desired path. The retry endpoint at `/api/corpus/retry` writes `text[:50000]` to `stage_2_queue` (line 1074) — reintroducing the 50KB truncation pattern that F14 fixed elsewhere. **NREM-shape divergence: the truncation cap was removed from `watcher.py`, `ingest.py`, and `corpus_integrity.py` per the F14 fix on 2026-05-01, but `api.py` retry path was not patched.**
### `dream.py`
- **Path:** `scripts/dream.py`
- **Status:** Working (post NREM-fix)
- **Last-touched:** 2026-05-02
- **What it does:** The Active Inference engine. Provides the nightly pipeline (NREM → Early REM → Late REM → Synthesis) and a single-mode CLI entry-point. Each stage retrieves chunks from pgvector (or Graphiti when `DREAMER_SUBSTRATE=graphiti`), prompts Claude Sonnet, writes a markdown file to Nextcloud `Journal/Dreams/` via WebDAV, and feeds its output as context into the next stage. Pipeline writes a per-night manifest JSON. Lucid mode is the on-demand path used by Settings → Dream Now. State persisted in `~/aaronai/dreamer_state.json`; cumulative `retrieved_sources` capped at 500, trimmed to 400 on overflow. Score-band Early-REM exclusion (v1.1) preserved. The 2026-05-02 NREM exclusion fix is at line 478: `nrem_chunks = retrieve("nrem", excluded_sources=None)`.
- **Dependencies:** `.env` (`PG_DSN`, `ANTHROPIC_API_KEY`, `NEXTCLOUD_*`); `pgvector` `embeddings` table (or graphiti sidecar `/search`); SentenceTransformer `all-MiniLM-L6-v2` (re-loaded inside `retrieve()`); `~/aaronai/dreamer_state.json`, `~/aaronai/watcher_state.json`, `~/aaronai/conversations.db`; Anthropic API; Nextcloud WebDAV; for SSE notify, the running `api.py` on `localhost:8000`.
- **What depends on it:** APScheduler in `api.py` shells out to it nightly; `/api/dreamer/run` shells out for on-demand runs; `aaronai-dreamer.service` (Type=oneshot) wraps it for manual invocation; `e3_dreamer_substrate.py` invokes it under `DREAMER_SUBSTRATE=graphiti`.
- **Behavior matches intent?** Yes for NREM (post-fix matches reframe's replay-and-consolidation framing); yes for Early REM and Late REM (still consult `previously_retrieved`, which the reframe permits as novelty bias); partial for Synthesis (no substrate mutation, which is fine under the architecture doc but is exactly what the reframe says is missing for E6 to work); "lucid" is implemented even though architecture doc lists Lucid mode as "not yet built" (the function exists and is reachable from the CLI/API).
- **Notes:** `retrieve_graphiti()` accepts and applies `excluded_sources` (the F1 fix), but the over-fetch is `n_results * 3` and the post-filter is in-process. Dreamer falls back gracefully to empty when sidecar fails. **NREM-shape divergence candidate: the dreamer's exclusion-set state is *cumulative across all nights*, capped at 500 — every Early REM and Late REM excludes up to 500 prior sources. On a corpus of 1,200 sources this is ~40% of the corpus permanently invisible to Early/Late REM after the cap fills. The architecture doc and reframe don't specify cumulative-across-nights exclusion; they specify session-scoped novelty. The bug shape is the same as the NREM exclusion bug — a deduplication mechanism functioning silently in a way the architecture didn't request.** Flagged.
### `watcher.py`
- **Path:** `scripts/watcher.py`
- **Status:** Working
- **Last-touched:** 2026-05-01
- **What it does:** Stage 1 of the encoding pipeline. Watches `/home/aaron/nextcloud/data/data/aaron/files` recursively via watchdog. Loads SentenceTransformer `all-MiniLM-L6-v2` once at startup. On modify/create/move/close events, debounces 120s, then chunks (500-word with 50-word overlap), embeds, and writes to pgvector `embeddings`. Enqueues full text to `stage_2_queue` unless `SKIP_STAGE2_ENQUEUE` is set. Records extraction or pgvector failures to `ingest_failures` and resolves them on success. Heartbeat written every loop tick to `~/aaronai/watcher_heartbeat`. Status JSON written to `~/aaronai/watcher_status.json`. Startup recovery scans for files with changed mtimes since last run. `on_moved` checks `dest_path` (Nextcloud writes `.part` then renames), `on_closed` belt-and-suspenders.
- **Dependencies:** `.env` (`PG_DSN`); pgvector; SentenceTransformer; `pypdf`, `python-docx`, `python-pptx`; watchdog; `~/aaronai/watcher_state.json`. Runs as `aaronai-watcher.service`.
- **What depends on it:** Anything that reads from pgvector `embeddings` (api.py chat, dream.py retrieval, tier1_migration.py); anything that polls `stage_2_queue` (stage2_worker); `corpus_integrity.py`; the watcher heartbeat is consumed by an external cron monitor mentioned in tech-debt.
- **Behavior matches intent?** Yes against the architecture's Stage 1 description and the parity principle (no filtering, no decisions). The full-text path no longer truncates to 50KB. Under the bespoke decision the Stage 2 enqueue path is on the chopping block; it is currently still active and runs by default.
- **Notes:** No truncation in `enqueue_stage2()`. `Admin/Backups` and `Journal/Media/` are excluded from indexing per the architecture's File Management Policy. `SKIP_STAGE2_ENQUEUE` env var is the documented kill-switch for migration runs.
### `ingest.py`
- **Path:** `scripts/ingest.py`
- **Status:** Working-degraded (functional but architecturally redundant)
- **Last-touched:** 2026-05-01
- **What it does:** Bulk folder ingester. Loads SentenceTransformer at module import, walks a folder, extracts text, chunks, embeds, writes to `embeddings`, and (unless `SKIP_STAGE2_ENQUEUE`) enqueues to `stage_2_queue`. Invoked by `api.py`'s `/api/reindex` endpoint with `NEXTCLOUD_PATH` as argument. CLI default target is `~/aaronai/docs`.
- **Dependencies:** Same as `watcher.py` minus watchdog. `.env`, pgvector, SentenceTransformer. No service unit — invoked on demand only.
- **What depends on it:** `api.py` `/api/reindex` button; the architecture's tech-debt entry mentions `ingest_chatgpt.py` and `ingest_claude.py` (manual one-shot scripts) but neither of those files is present in `scripts/` — so the only live caller is `/api/reindex`.
- **Behavior matches intent?** Partial. The architecture doc has it as one of four ingest scripts in the Layer 1 table. Only this file and `ingest_conversations.py` exist. The chunk-embed-store flow still matches Stage 1 intent. The Stage 2 enqueue side effect (running every reindex) is a wide blast radius — clicking "Re-index" puts every changed file back through cascade, which under the bespoke decision is increasingly unwanted work.
- **Notes:** Almost the entire chunk/embed/extract code path is duplicated verbatim with `watcher.py`. The architecture's tech-debt entry F11 (two implementations of encoding pipeline) is real — visible side-by-side. Both scripts call their own `enqueue_stage2()` defined inline; both call SentenceTransformer at import (model is loaded twice if both are imported in the same process, which only happens during unusual import patterns).
### `stage2_worker.py`
- **Path:** `scripts/stage2_worker.py`
- **Status:** Working
- **Last-touched:** 2026-05-01
- **What it does:** Polls `stage_2_queue` for rows with no `completed_at`/`failed_at` and `attempts < 3`. Sends document to local Mistral (`mistral:latest` via Ollama on port 11434) with a taxonomy-free prompt that returns four fields: `active_frames`, `frame_relationships`, `extraction_orientation`, `one_sentence_summary`. Documents under 2000 chars skip Stage 3 and are marked complete. Otherwise builds an orientation string and enqueues `stage_3_queue` with `(source, full_text, orientation, stage2_metadata)`. Wedge recovery: 2+ consecutive failures triggers `sudo systemctl restart ollama`. Logs to `/var/log/aaronai/stage2.log`. Heartbeat at `/var/log/aaronai/stage2-heartbeat`. Worker version 2.1.
- **Dependencies:** `.env` (`PG_DSN`); Ollama on `localhost:11434`; `mistral:latest` model loaded; passwordless sudo for `/bin/systemctl restart ollama` (per `/etc/sudoers.d/aaron-aaronai`); PostgreSQL `stage_2_queue` and `stage_3_queue` tables. Runs as `aaronai-stage2.service`.
- **What depends on it:** Anything that reads `stage_3_queue.completed_at` (corpus_integrity, api.py corpus status); Stage 3 worker as the queue consumer.
- **Behavior matches intent?** Partial under the reframe. The taxonomy-free prompt matches the Stage 3.1 research direction the architecture doc described. Under the bespoke decision the entire Stage 2/3 pipeline is being re-evaluated; the worker itself is doing what it was redesigned to do.
- **Notes:** `recover_wedge()` calls absolute `/usr/bin/sudo` and `/bin/systemctl` paths (per the v2.1 patch). No `WatchdogSec`-driven SIGKILL pattern (commented out in the systemd unit per the 2026-05-01 fix). Mistral parse-failure is detected and surfaces as `failure_reason='mistral_parse_failure'`. `RETRY_ATTEMPTS = 2` plus the original attempt = 3 max attempts before the row is dead; this matches the worker's SQL `attempts < %s` with `RETRY_ATTEMPTS + 1`.
### `stage3_worker.py`
- **Path:** `scripts/stage3_worker.py`
- **Status:** Stopped (per session brief — service stopped manually 2026-05-02; code is unchanged)
- **Last-touched:** 2026-05-01
- **What it does:** Polls `stage_3_queue` for rows ready to process. For each, chunks document at 500-word boundaries (matching Stage 1), and POSTs to graphiti sidecar `/episodes/bulk`. Three paths by document size: (a) <1500 chars → single episode, no saga; (b) ≤10 chunks → single bulk commit with a saga tag; (c) >10 chunks → split into batches of 10 each, all tagged with the same saga so graphiti links them as one document unit. Wedge recovery: 2+ consecutive failures triggers `sudo systemctl restart aaronai-graphiti.service`, then waits 45s for sentence-transformers + BGE reranker + graphiti to re-init. Worker version 2.2.
- **Dependencies:** `.env` (`PG_DSN`); graphiti sidecar on `localhost:8001`; passwordless sudo for `/bin/systemctl restart aaronai-graphiti.service`; PostgreSQL `stage_3_queue`. Runs as `aaronai-stage3.service`.
- **What depends on it:** `corpus_integrity.py` reads `stage_3_queue.completed_at` to compute "Graphiti-side" coverage; `api.py`'s `/api/corpus/status` does the same.
- **Behavior matches intent?** No, against the bespoke decision. The architecture doc describes Stage 3 as the cascade ingest path into graphiti; the bespoke decision dissolves that path. The code itself does what it was patched to do (saga splitting, wedge detection, sudoers). What it represents — feeding documents into a graphiti substrate — is no longer the architectural target.
- **Notes:** Service is stopped per the session brief, but `stage_3_queue` rows continue to be created by `stage2_worker.py`, so the queue grows monotonically while the consumer is off. This is fine for the rollback baseline (no new rows of consequence with cascade prompts in the rolled-back form), but is worth flagging in case the watcher picks up new files. Uses the absolute `/usr/bin/sudo` and `/bin/systemctl` paths (v2.2 patch). `start` and `end` chunk indices are 1-based in the saga-batch logging — cosmetic only.
### `graphiti_service.py`
- **Path:** `scripts/graphiti_service.py`
- **Status:** Working (per the session brief; will be deprecated when bespoke substrate replaces graphiti)
- **Last-touched:** 2026-04-30 (commit), 2026-05-02 (working-copy mtime — same content, file was rewritten then reset during rollback)
- **What it does:** FastAPI sidecar on port 8001. Wraps `graphiti-core` to avoid asyncio event loop conflicts in the main FastAPI process. Single graphiti instance built in lifespan, closed on shutdown. Endpoints: `/health`, `POST /episodes` (single), `POST /episodes/bulk` (with optional `saga` link), `GET /search`. Uses `SentenceTransformerEmbedder` from `st_embedder.py` and `BGERerankerClient` from graphiti-core. `FalkorDriver` connects to FalkorDB at `localhost:6379` database `aaron`. LLM provider switchable via env (`anthropic` default → `claude-sonnet-4-6`). `max_coroutines=2`, `EMBEDDING_DIM=384`. Hard-coded group default `aaron`.
- **Dependencies:** `.env` (`ANTHROPIC_API_KEY` or `LLM_API_KEY`, `LLM_PROVIDER`, `LLM_MODEL`, `FALKORDB_HOST`, `FALKORDB_PORT`, `GRAPHITI_GROUP_ID`); FalkorDB Docker container on `127.0.0.1:6379`; graphiti-core 0.29.0 in venv; sentence-transformers, BGE reranker. Runs as `aaronai-graphiti.service`.
- **What depends on it:** `dream.py` `retrieve_graphiti()` (only when `DREAMER_SUBSTRATE=graphiti`); `stage3_worker.py` posts to it; `tier1_migration.py` posts to it; the bulk cost-test scripts post to it; `e3_dreamer_substrate.py` queries it; `e1_8_taxfree_cascade.py` and `e1_9_retroactive.py` post or query.
- **Behavior matches intent?** Yes against the architecture doc. Under the bespoke decision this whole sidecar is the layer being replaced; the doc still says it's the target memory layer.
- **Notes:** `add_episode_bulk()` is called with `saga=req.saga or None` — the saga param is what stage3_worker uses to link split-batch chunks. Result body returns `{"ok": true, "count": N}` rather than the underlying graphiti return value. Logs full traceback to `/var/log/aaronai/graphiti-sidecar.log` (the 2026-04-30 fix).
### `corpus_integrity.py`
- **Path:** `scripts/corpus_integrity.py`
- **Status:** Working
- **Last-touched:** 2026-05-01
- **What it does:** Three-way reconciliation. Compares filesystem (Nextcloud), pgvector (`embeddings.source`), and graphiti (`tier1_migration_state.json` ingested list `stage_3_queue.completed_at IS NOT NULL` source list). Reports counts in each set, and gaps (in filesystem but neither pgvector nor graphiti). With `--fix`, attempts text extraction on each gap file and either enqueues to `stage_2_queue` (full text, no truncation) or writes to `ingest_failures` if extraction returns empty. Writes `~/aaronai/corpus_integrity_report.json`.
- **Dependencies:** `.env`; pgvector `embeddings`, `stage_3_queue`, `ingest_failures`, `stage_2_queue`; `~/aaronai/experiments/tier1_migration_state.json`; pypdf, python-docx, python-pptx. No service unit — invoked by `api.py /api/corpus/reconcile` background task and by the user manually.
- **What depends on it:** `api.py /api/corpus/status` reads the report it writes; the SettingsPanel UI's "Ingest Health" section consumes that.
- **Behavior matches intent?** Partial. Implements the architecture's "ingest_failures + reconciliation" tech-debt-resolved item correctly. Under the bespoke decision, the graphiti side of the reconciliation is meaningless after Stage 3 is shut off — the script will keep happily reporting "this many sources are in graphiti" but those numbers won't move and won't represent useful state. Not broken, but the report's "graphiti only" / "Both" lines become semantically empty.
- **Notes:** Re-implements `extract_text` for retry path inline rather than reusing watcher's; another instance of F11.
### `ingest_conversations.py`
- **Path:** `scripts/ingest_conversations.py`
- **Status:** Working
- **Last-touched:** 2026-04-27
- **What it does:** Nightly job. Reads `conversations.db`, finds conversations with ≥3 user-assistant exchanges, slides a 2-exchange window, formats `[Aaron AI conversation: title]` chunks, embeds with SentenceTransformer, writes to pgvector `embeddings` with `id = aaronai_conv_{conv_id}_{idx}` and `type='aaronai_conversation'`. Idempotent via `ON CONFLICT DO UPDATE`.
- **Dependencies:** `.env`; pgvector; `conversations.db`. Triggered by APScheduler in `api.py` at 02:30 UTC.
- **What depends on it:** Anything reading from pgvector. Indirect: dream.py and chat retrieval pull these chunks.
- **Behavior matches intent?** Yes. Matches the architecture's Layer 1 ingest table.
- **Notes:** No watchdog/state — re-runs each night and skips already-embedded ids. `cur.close()` is missing on the read connection at line 39 (the conn is closed though, so it's harmless).
### `st_embedder.py`
- **Path:** `scripts/st_embedder.py`
- **Status:** Working
- **Last-touched:** 2026-04-27
- **What it does:** `EmbedderClient` adapter for graphiti-core. Wraps SentenceTransformer `all-MiniLM-L6-v2` (384-dim) so graphiti uses the same embedding model as Stage 1. No API cost for graphiti embeddings.
- **Dependencies:** `graphiti_core.embedder.client`, sentence-transformers.
- **What depends on it:** `graphiti_service.py` imports it at sidecar startup.
- **Behavior matches intent?** Yes. Implements the "embedding layer stays on Sentence Transformers regardless of LLM" architectural commitment.
- **Notes:** Will be obsolete when graphiti is replaced under the bespoke decision (the embedder pattern carries over but this specific adapter does not).
### `tier1_migration.py`
- **Path:** `scripts/tier1_migration.py`
- **Status:** Stable but unused (already-run one-shot)
- **Last-touched:** 2026-04-30
- **What it does:** Migrates ~300 most-recent pgvector sources to graphiti via the sidecar's `/episodes/bulk` endpoint. Resumable via `~/aaronai/experiments/tier1_migration_state.json`. Adapts batch size to document length (`BATCH_SIZE=4`, `LONG_DOC_BATCH_SIZE=2` for docs ≥5000 chars). Implements Max-pending-queries / timeout / rate-limit backoff. Writes per-batch results to `tier1_migration_results.json`.
- **Dependencies:** `.env` (`PG_DSN`); graphiti sidecar; `~/aaronai/experiments/`. No service unit.
- **What depends on it:** `corpus_integrity.py` reads the state file. `api.py` corpus status reads the same file. Both treat ingested-list as part of the "graphiti coverage" answer.
- **Behavior matches intent?** Yes against the architecture's Tier 1 migration plan (already complete per the doc — 1,205 sources, 4,990 nodes, 22,289 edges). Obsolete under the bespoke decision but harmless if not run again.
- **Notes:** Hard-codes `timestamp: "2026-04-28T00:00:00"` for migration episodes — all migrated sources land with that bi-temporal `valid_at`. The migration state file lives in `~/aaronai/experiments/`, which is referenced from multiple downstream readers — moving or deleting it would break corpus integrity status.
### `consolidator_v0_1.py`
- **Path:** `scripts/consolidator_v0_1.py`
- **Status:** Deprecated (per reframe doc and bespoke decision)
- **Last-touched:** 2026-04-29 (commit), 2026-04-30 (working tree)
- **What it does:** Calibration-phase alias resolution. Pulls all `:Entity` nodes from FalkorDB `aaron` graph, computes summary embeddings via Ollama `nomic-embed-text`, infers light type labels heuristically, computes pairwise (name, ego, neighbor) similarity within type blocks, writes a markdown proposals log to `Nextcloud/Journal/Consolidation/proposals-{ts}.md` plus a JSON sibling. **Does not execute merges.** The 0.1.5 in-place patch (containment metric replacing Jaccard, summary embeddings) is reflected in this file; the `.bak` is the pre-patch version.
- **Dependencies:** FalkorDB on port 6379 (direct, not via sidecar); Ollama for embeddings; `Nextcloud/Journal/Consolidation/`.
- **What depends on it:** Nothing in production. Designed for human review of proposals.
- **Behavior matches intent?** No, under the reframe and bespoke decision. The reframe doc explicitly identifies "consolidator-as-separate-system" as the architectural mistake — its function moves into the dream phase. Track 1 should consider this a removal candidate.
- **Notes:** No service unit, no scheduler entry — executed manually only. Calibration findings (2026-04-29) showed alias-from-graph-features-alone has structural problems on this corpus.
### `backup.sh`
- **Path:** `scripts/backup.sh`
- **Status:** Working
- **Last-touched:** 2026-04-26
- **What it does:** Daily-snapshot bash script. Copies `memory.md`, `settings.json`, `conversations.db` into `~/nextcloud/.../Admin/Backups/` with date-stamped names; deletes anything older than 7 days. Output ends up inside Nextcloud's `Admin/Backups/`, which the watcher excludes from indexing — so backups don't pollute the corpus.
- **Dependencies:** Read access to the three files; write access to `Admin/Backups/`.
- **What depends on it:** Nothing programmatic. Operationally: the only off-host backup of `memory.md` and `settings.json`.
- **Behavior matches intent?** Yes. Lightweight, no-judgement copy → Nextcloud → Nextcloud Desktop → off-machine.
- **Notes:** Cron-driven (Phase 5 will confirm). Uses `find -mtime +7 -delete` so naming-format changes wouldn't break retention.
### Experimental scripts (one-shot research artifacts)
The following scripts are all completed experiments. None has a service unit, none is on a schedule, none is a runtime dependency of any production code path. They are kept as reproducibility artifacts for the experiments log. **All are candidates for moving out of `scripts/` into `experiments/` or `deprecated/`** — they crowd the production directory and on cursory inspection it is hard to tell at-a-glance which files are live workers.
| File | Experiment | Status | Notes |
|---|---|---|---|
| `audit_expansion_draw.py` | Type-aware stratified draw for n=20 audit expansion | Experimental | Sample-construction tool for `base_class_audit_rerun.py` |
| `base_class_test.py` | Base-class enrichment n=20 | Experimental | OOP framing experiment, validated 2026-04-28 |
| `base_class_validation.py` | Base-class enrichment n=50 | Experimental | Main validation study |
| `base_class_audit_rerun.py` | Base-class enrichment audit rerun | Experimental | n=8 paired-extraction audit, 0% fabrication |
| `briefing_generator_v2.py` | Experiment 002b (briefing v2) | Experimental | Validated local Mistral structural pattern recognition at 96% |
| `briefing_test.py` | Experiment 002 (briefing v1) | Experimental | Superseded by v2 |
| `cascade_test.py` | Entity-drafter cascade n=20 | Experimental | Falsified 2026-04-28 |
| `cascade_optimization_test.py` | Optimized entity-drafter cascade n=30 | Experimental | Confirmed entity-drafter cascade is dead |
| `consistency_test.py` | Mistral 3-pass consistency n=50 | Experimental | Experiment 001 |
| `consistency_test_v2.py` | Entity-only consistency, fixed sampling | Experimental | Experiment 003 |
| `cost_test_graphiti_bulk.py` | Bulk endpoint cost test | Experimental | Stratified n=50 |
| `cost_test_graphiti_bulk_retry.py` | Retry of failed bulk batches | Experimental | Pre-MAX_QUEUED_QUERIES bump |
| `cost_test_graphiti_bulk_retry2.py` | Second retry attempt | Experimental | Smaller batches, post-bump |
| `cost_test_graphiti_migration.py` | Single-episode migration cost test | Experimental | Stratified n=50 |
| `e1_select_sample.py` | E1 sample selection | Experimental | Cascade re-extraction sample |
| `e1_run_cascade.py` | E1 orchestration | Experimental | Initial cascade run, group `aaron_cascade_test` |
| `e1_run_cascade_corrected.py` | E1 corrected (custom_extraction_instructions path) | Experimental | Re-run with the fixed prompt-path |
| `e1_per_source_predicates.py` | E1 per-source predicate count | Experimental | Corrected metric |
| `e1_compare_metrics.py` | E1 A vs B metrics comparison | Experimental | Reads from FalkorDB via redis-cli docker exec |
| `e14_select_sample.py` | E1.4 sample selection (n=30) | Experimental | Stratified, excludes E1's 10 |
| `e14_run_cascade.py` | E1.4 cascade orchestration | Experimental | Group `aaron_cascade_e14` |
| `e14_per_source_predicates.py` | E1.4 per-source predicate diversity | Experimental | Bucket-level analysis |
| `e16_rate_purity.py` | E1.6 domain-purity human rating UI | Experimental | Surfaces taxonomic-mismatch finding |
| `e16_analyze.py` | E1.6 Spearman correlation against E1.4 | Experimental | Pre-registered decision rules |
| `e2_resolution_check.py` | E2 entity resolution diagnostic | Experimental | Six test entities, FalkorDB query |
| `e2_alias_followup.py` | E2 alias follow-up | Experimental | Aaron AI variants etc. |
| `e2_source_diversity.py` | E2 episode count per entity | Experimental | Diagnostic |
| `token_measurement_test.py` | Experiment 005 — token reduction | Experimental | Validates 42.0% modeled estimate |
| `experiments/e1_8_eval.py` | E1.8 eval phase | Experimental | Pulls predicate counts |
| `experiments/e1_8_taxfree_cascade.py` | E1.8 ingest phase | Experimental | Taxonomy-free cascade |
| `experiments/e1_9_retroactive.py` | E1.9 retroactive validation | Experimental | Phase 1 parked 2026-04-30 (graph immature) |
| `experiments/e3_dreamer_substrate.py` | E3 dreamer substrate comparison | In-flight | "Genuinely ready" per architecture doc post-F1 fix; per bespoke decision now confounded — not runnable to produce a trustworthy answer |
The `e3_dreamer_substrate.py` script is the only one with current relevance: its run was the proximate cause of the bespoke decision (per the decision doc, running E6 on graphiti is "a vibe check" because of issue #1325 and friends). Code is functional; under the bespoke decision the experiment it runs cannot produce a trustworthy answer.
### Backup files (`.bak*`)
The following are point-in-time copies left behind by the rollback work. None is on any code path. They are documented as a group rather than individually:
- `api.py.bak.20260501-001427`
- `consolidator_v0_1.py.bak` (pre-0.1.5-patch)
- `corpus_integrity.py.bak.20260501-021703`
- `dream.py.bak`, `dream.py.bak.20260501-002209`
- `graphiti_service.py.bak`, `graphiti_service.py.bak.20260501-185619`, `graphiti_service.py.bak.20260502-022307`
- `ingest.py.bak.20260501-004131`
- `stage2_worker.py.bak.20260501-171928`, `.20260501-172531`, `.20260501-185942`
- `stage3_worker.py.bak.20260501-050354`, `.20260501-050453`, `.20260501-050719`, `.20260501-173233`, `.20260501-190357`
- `watcher.py.bak`, `watcher.py.bak.20260501-004131`
Stage 3 alone has five `.bak` versions, which matches the v2.0 → v2.1 → v2.2 patch history. Track 1 cleanup candidate: collapse all `.bak*` into a `deprecated/` or remove (git history is the durable artifact).
### `__pycache__/`
Compiled `.pyc` files for `api`, `corpus_integrity`, `dream`, `ingest`, `stage3_worker`, `st_embedder`, `watcher`. Notably *no* `.pyc` for `stage2_worker.py` — the worker imports under uvicorn's process lifecycle rather than via Python's standard import machinery, but that's a guess from absence; uncertain. Not a code path. Remove on next clean build if desired.
---
### Phase 1 summary
**Working and matching intent:**
- `watcher.py` (Stage 1)
- `ingest_conversations.py` (nightly conversation indexer)
- `st_embedder.py`
- `backup.sh`
**Working with behavior-vs-intent divergences:**
- `api.py` — dead `/auth/check` reference; voice capture doesn't archive raw audio to `Journal/Media/`; `/api/corpus/retry` reintroduces 50KB truncation.
- `dream.py` — cumulative 500-source exclusion across nights is a NREM-shape divergence: silently shrinks Early/Late REM's reachable corpus over time without architectural mandate. NREM exclusion fix is in place but the pattern that caused that bug exists at a different layer.
- `ingest.py` — duplicates Stage 1 logic (F11), default behavior re-enqueues to Stage 2 on every reindex.
- `stage2_worker.py` — works as designed; under the bespoke decision is doing work that's no longer the architectural target.
- `corpus_integrity.py` — graphiti side of the report becomes semantically empty after Stage 3 shutoff.
- `graphiti_service.py` — works as designed; same story as Stage 2 — not aligned with bespoke direction.
**Stopped / deprecated / experimental:**
- `stage3_worker.py` — service stopped manually; code in repo, last-modified 2026-05-01.
- `consolidator_v0_1.py` — reframe-deprecated.
- `tier1_migration.py` — already-run one-shot, kept as reproducibility artifact.
- All 32 experimental scripts in `scripts/` and `scripts/experiments/`.
- `e3_dreamer_substrate.py` — in-flight per architecture doc, confounded per bespoke decision.
**Removal candidates (do not remove):**
- All `.bak*` files (~20 of them) — git history covers them.
- The 32 experimental scripts could move to `deprecated/` or `experiments/` to clean up `scripts/`.
- `consolidator_v0_1.py` — explicitly deprecated by reframe.
- `tier1_migration.py` — completed migration; kept for reproducibility.
**NREM-shaped divergences (the most important class of finding):**
1. **`dream.py` cumulative exclusion 500-cap.** The `retrieved_sources` list grows across nights and is the exclusion set for Early REM and Late REM. After enough nights it reliably hides ~40% of the corpus. The architecture and reframe specify session-scoped novelty, not corpus-lifetime exclusion. Same shape as the NREM bug: a deduplication mechanism running silently in a way the architecture didn't request.
2. **`api.py /api/corpus/retry` 50KB truncation.** The F14 fix removed truncation from `watcher.py`, `ingest.py`, `corpus_integrity.py`, but the api.py retry path was missed — clicking "Retry" on an ingest-failure still truncates. Working without errors, doing something the architecture explicitly says not to.
---
## Phase 2 — Systemd services
Inventory of every `aaronai*.service` and `aaronai*.timer` in `/etc/systemd/system/`. Status is from `systemctl is-enabled` and `systemctl is-active` taken during this session.
### `aaronai.service`
- **Status:** Working (enabled, active)
- **Unit-file mtime:** 2026-04-24
- **Type / trigger:** `simple`, `Restart=always`, `WantedBy=multi-user.target`. Always-running.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/api.py`
- **Depends on:** `network.target`
- **What depends on it:** `aaronai-graphiti.service`, `aaronai-stage2.service`, `aaronai-stage3.service`, `aaronai-watcher.service` all `After=` it; `Requires=aaronai.service` on Stage 2 and Stage 3.
- **Behavior matches intent?** Yes. Hosts the FastAPI backend and the embedded APScheduler. The architecture doc lists this as the long-running api.py process hosting nightly cycles.
- **Notes:** No `WatchdogSec`. Restarts on crash. Has been "running since May 01" per the current-state doc.
### `aaronai-graphiti.service`
- **Status:** Working (enabled, active)
- **Unit-file mtime:** 2026-04-27
- **Type / trigger:** `simple`, `Restart=always`, always-running.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/graphiti_service.py`
- **Depends on:** `aaronai.service` (After=, soft); FalkorDB Docker container at `127.0.0.1:6379`; `.env`.
- **What depends on it:** `aaronai-stage3.service` (Requires=); `dream.py` when `DREAMER_SUBSTRATE=graphiti`; the Stage 3 worker's `recover_wedge` does `sudo systemctl restart aaronai-graphiti.service`.
- **Behavior matches intent?** Yes against architecture doc. Under bespoke decision this is the layer being replaced. Service still runs and the sidecar still answers `/health`.
- **Notes:** The 2026-05-01 v2.1 patches (sudoers entry, error logging) are applied in the worker code that calls this; the service unit itself is unchanged.
### `aaronai-stage2.service`
- **Status:** Working (enabled, active)
- **Unit-file mtime:** 2026-05-01
- **Type / trigger:** `simple`, `Restart=always`, `Requires=aaronai.service`. Always-running worker.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/stage2_worker.py`
- **Depends on:** `aaronai.service` (Requires=); Ollama on 11434; `.env`.
- **What depends on it:** Stage 3 worker (consumes the queue this fills).
- **Behavior matches intent?** Yes for the worker code. Under the bespoke decision the cascade pipeline this feeds is no longer the architectural target — but the unit is doing what its code says.
- **Notes:** `WatchdogSec` line is commented out (the 2026-05-01 fix). Logs to `/var/log/aaronai/stage2.log`.
### `aaronai-stage3.service`
- **Status:** Stopped (enabled, **inactive**) — manually stopped per the session brief
- **Unit-file mtime:** 2026-05-01
- **Type / trigger:** `simple`, `Restart=always`, `Requires=aaronai.service aaronai-graphiti.service`. Would be always-running if started.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/stage3_worker.py`
- **Depends on:** `aaronai.service` and `aaronai-graphiti.service` (both Requires=); `.env`; passwordless sudo for `systemctl restart aaronai-graphiti.service`.
- **What depends on it:** Nothing technically requires it; corpus integrity reads `stage_3_queue.completed_at` and would see those numbers stop moving while the worker is off.
- **Behavior matches intent?** **Divergence.** The unit is `enabled` (i.e., will start at next boot) but currently inactive. The bespoke decision parks this work; on reboot the service will start automatically and resume processing `stage_3_queue` rows. Track 1 cleanup should `systemctl disable` it before next reboot — otherwise the manual stop is a soft guarantee that doesn't survive a power cycle.
- **Notes:** `WatchdogSec` line is commented out (the 2026-05-01 fix). Logs to `/var/log/aaronai/stage3.log`. The service file's `Description` still says "Graphiti cascade ingest" — accurate but architecturally stale under bespoke.
### `aaronai-watcher.service`
- **Status:** Working (enabled, active)
- **Unit-file mtime:** 2026-04-30
- **Type / trigger:** `simple`, `Restart=always`. Always-running.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/watcher.py`
- **Environment:** `TRANSFORMERS_OFFLINE=1`, `HF_HUB_OFFLINE=1`, `PATH=/home/aaron/aaronai/venv/bin`. Resource caps: `MemoryMax=3G`, `MemorySwapMax=0`.
- **Depends on:** `aaronai.service` (After=); pgvector; SentenceTransformer model files (offline mode means they must already be cached).
- **What depends on it:** Anything that reads pgvector or `stage_2_queue` indirectly depends on this filling them.
- **Behavior matches intent?** Yes. Stage 1 architectural commitment. The 2026-04-30 in-process refactor matches the architecture doc.
- **Notes:** `MemorySwapMax=0` is the post-refactor commitment. Watcher heartbeat at `/home/aaron/aaronai/watcher_heartbeat` is consumed by an external cron monitor (Phase 5 confirms).
### `aaronai-web.service`
- **Status:** Working (enabled, active)
- **Unit-file mtime:** 2026-04-26
- **Type / trigger:** `simple`, `Restart=always`. Always-running.
- **Command:** `/usr/bin/node node_modules/next/dist/bin/next start` from `/home/aaron/aaronai-web` with `NODE_ENV=production` and `PORT=3000`.
- **Depends on:** `network.target`.
- **What depends on it:** nginx reverse-proxies to port 3000 (per architecture doc); Cloudflare-fronted `ai.aaronnelson.studio`.
- **Behavior matches intent?** Yes. Hosts the Next.js frontend per Layer 3 architecture.
- **Notes:** Working directory is `~/aaronai-web/` not `~/projects/aaronai-web/` — production deployment is a separate clone of the repo. This is consistent with the architecture doc's "Local: `~/projects/aaronai-web/`, deployed: `~/aaronai-web/`" line.
### `aaronai-dreamer.service`
- **Status:** Working (oneshot; static)
- **Unit-file mtime:** 2026-04-26
- **Type / trigger:** `Type=oneshot`. Not directly schedulable from systemd (no `[Install]` block — `static`).
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/dream.py --mode nrem`
- **Depends on:** `network.target`.
- **What depends on it:** The session brief noted this service was used for the manual NREM run on 2026-05-02 21:33-21:34 UTC. APScheduler in `api.py` is the production trigger and uses `subprocess.Popen` directly (not this unit) — the unit is only for manual `systemctl start aaronai-dreamer.service` from the shell.
- **Behavior matches intent?** Partial. The unit exists and is the only systemd-tracked dream entry point. **It still hardcodes `--mode nrem`** as the command, so a manual `systemctl start aaronai-dreamer.service` runs only NREM, not the full pipeline. The architecture says nightly is full pipeline; the production scheduler in api.py runs `dream.py` with no flag (i.e., default pipeline). The unit's `--mode nrem` is therefore an outdated invocation pattern preserved from when individual stages were run by hand.
- **Notes:** Has a paired `aaronai-dreamer.timer` (next entry) that is **not enabled**. APScheduler is the only thing actually triggering nightly dreams.
### `aaronai-dreamer.timer`
- **Status:** Stopped — exists but **not in `timers.target.wants/`**, so not enabled
- **Unit-file mtime:** 2026-04-27
- **Schedule:** `OnCalendar=*-*-* 08:00:00`, `Persistent=true`.
- **Triggers:** `aaronai-dreamer.service`
- **Behavior matches intent?** Divergence — duplicate scheduling. APScheduler in `api.py` drives the actual 08:00 UTC dream run. This timer would do the same thing (with the wrong invocation — `--mode nrem`) if it were enabled. **NREM-shape divergence: a scheduling mechanism present, configured, and inactive — but its presence will confuse a future reader about who triggers the dream.** Track 1 cleanup candidate: remove or disable explicitly.
### `aaronai-index-conversations.service`
- **Status:** Working (oneshot; static)
- **Unit-file mtime:** 2026-04-26
- **Type / trigger:** `Type=oneshot`. Static, no Install section.
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/ingest_conversations.py`
- **Depends on:** `network.target`.
- **What depends on it:** Manually triggerable. APScheduler in `api.py` runs `ingest_conversations.py` directly via `subprocess.run` — not this unit.
- **Behavior matches intent?** Same shape as the dreamer unit: an alternate entry point that exists for manual debugging. Not on a path that fires.
- **Notes:** Logs to `/home/aaron/aaronai/dreamer.log` — same log file as the dreamer service (likely a copy-paste artifact, not a deliberate co-mingling).
### `aaronai-index-conversations.timer`
- **Status:** Stopped — not enabled
- **Unit-file mtime:** 2026-04-26
- **Schedule:** `OnCalendar=*-*-* 02:30:00`, `Persistent=true`.
- **Triggers:** `aaronai-index-conversations.service`
- **Behavior matches intent?** Same divergence pattern as `aaronai-dreamer.timer`. APScheduler in `api.py` is the real driver at 02:30 UTC. This timer is dormant and would silently double-fire the job if enabled.
### `aaronai-maintenance.service`
- **Status:** Broken (oneshot; static; **command is unrunnable**)
- **Unit-file mtime:** 2026-04-26
- **Type / trigger:** `Type=oneshot`. Static.
- **Command:** `/home/aaron/aaronai/venv/bin/chops hnsw rebuild --path /home/aaron/aaronai/db --collection aaronai`
- **Depends on:** `chops` binary in venv, ChromaDB at `/home/aaron/aaronai/db/`.
- **What depends on it:** Nothing. `aaronai-maintenance.timer` would trigger it weekly if enabled, but the timer is not enabled.
- **Behavior matches intent?** **No.** This unit is from the ChromaDB era. The architecture doc records the ChromaDB → pgvector migration on 2026-04-26. Verified during this inventory: `chops` is **not present** in `~/aaronai/venv/bin/`, and `~/aaronai/db/` still contains `chroma.sqlite3` and a UUID-named subdirectory but is no longer the active corpus store. **If anyone ever ran `systemctl start aaronai-maintenance.service`, it would fail with command-not-found.**
- **Notes:** Track 1 removal candidate. Both this and its timer are pure dead state; the `~/aaronai/db/` directory is a separate cleanup decision (it holds historical ChromaDB data, possibly recoverable).
### `aaronai-maintenance.timer`
- **Status:** Stopped — not enabled
- **Unit-file mtime:** 2026-04-26
- **Schedule:** `OnCalendar=Sun *-*-* 04:00:00`, `Persistent=true`.
- **Triggers:** `aaronai-maintenance.service` (broken).
- **Behavior matches intent?** No — points at a broken service.
- **Notes:** Track 1 removal candidate.
---
### Phase 2 summary
**Working and matching intent:**
- `aaronai.service`
- `aaronai-graphiti.service` (matches the existing-architecture intent; bespoke decision will replace the layer it serves)
- `aaronai-stage2.service` (same caveat)
- `aaronai-watcher.service`
- `aaronai-web.service`
**Working with behavior-vs-intent divergences:**
- `aaronai-dreamer.service` — hardcodes `--mode nrem`; production trigger is APScheduler running default pipeline. The systemd entry-point and the production entry-point disagree about what "dream" means.
**Stopped / broken:**
- `aaronai-stage3.service` — manually stopped 2026-05-02; **still `enabled` so will autostart on next reboot**.
- `aaronai-dreamer.timer`, `aaronai-index-conversations.timer` — not enabled; redundant with APScheduler.
- `aaronai-maintenance.service` and `aaronai-maintenance.timer` — broken (`chops` not installed); ChromaDB-era leftover.
- `aaronai-index-conversations.service` — static, harmless oneshot wrapper.
**Removal candidates (do not remove):**
- `aaronai-maintenance.service` and `.timer`
- `aaronai-dreamer.timer`, `aaronai-index-conversations.timer` (or, alternatively, disable APScheduler and use the timers — the duplication is the problem, not the choice)
- `aaronai-stage3.service` should be `disabled` even if not removed, so the manual-stop survives a reboot.
**NREM-shaped divergences in Phase 2:**
1. **`aaronai-stage3.service` is `enabled` but `inactive`.** Manual stop does not survive reboot; on next reboot the worker resumes against `stage_3_queue`, which is being filled by Stage 2. Same shape as the NREM bug: the operationally-stopped state is paper-thin. The architecture's stated "service stopped" intent is undermined by a `systemctl is-enabled` value nobody changed.
2. **`aaronai-maintenance.service` against ChromaDB.** Service is configured, would attempt to run if its (disabled) timer fired, would fail. The architectural intent (ChromaDB retired) and the systemd state (unit still installed and enabled-static) are out of sync. The disabled timer is the only thing protecting against running this.
3. **Triple-scheduled triggers.** APScheduler in api.py + dreamer/index-conversations timer files = two competing schedulers configured for the same nightly work. Only APScheduler is firing; the other is dormant. This is exactly the mechanism-still-present-but-not-architecturally-intended pattern.
---
---
## Phase 3 — Database tables
PostgreSQL `aaronai` database, `public` schema. Five tables. Connected via `PG_DSN` from `.env` (value not echoed in this document). All queries `SELECT`-only and `\d`-style. Counts taken during this session.
### `embeddings`
- **Status:** Working (the production retrieval substrate)
- **Columns:**
- `id text NOT NULL` (PK)
- `document text NOT NULL` (chunk content)
- `embedding USER-DEFINED` (pgvector `vector(384)`)
- `source text` (filename/conversation title)
- `type text` (document / chatgpt_conversation / claude_conversation / aaronai_conversation / claude_memory / NULL)
- `created_at text` (string-typed, not timestamptz; many rows NULL)
- `metadata jsonb`
- **Indexes:**
- `embeddings_pkey` btree on `id`
- `embeddings_vector_idx` HNSW (m=16, ef_construction=64, vector_cosine_ops)
- `embeddings_source_idx` btree on `source`
- **Row count:** 13,874
- **Distinct sources:** 1,236
- **Type distribution:** `document` 1,368 | `chatgpt_conversation` 1,548 | `claude_conversation` 1,074 | `aaronai_conversation` 68 | `claude_memory` 1 | NULL 9,815
- **Writes:** `watcher.py:ingest_file()`, `ingest.py:ingest_file()`, `ingest_conversations.py:run()`, `corpus_integrity.py:queue_for_retry()` (writes to `stage_2_queue`, not here — but on a normal ingest path the chunks land here)
- **Reads:** `api.py:retrieve_context()`, `dream.py:retrieve()` (pgvector branch), `corpus_integrity.py`, `tier1_migration.py:fetch_tier1_sources()`, several experiment scripts
- **Behavior matches intent?** Partial. **9,815 of 13,874 rows have `type IS NULL` (~71%)** — this is unexpected given the architecture doc's commitment to typing every chunk. Looking at the code, `watcher.py:ingest_file()` writes `type='document'` and `ingest_conversations.py` writes `'aaronai_conversation'`. The 9,815 NULLs are likely artifacts of older ingest runs or `ingest_chatgpt.py`/`ingest_claude.py` (referenced in the architecture doc but not present in `scripts/` — possibly run as one-shots from an earlier point and deleted). **Additionally, `created_at` is stored as `text` rather than `timestamptz`**, and 12,109 rows have it NULL. Both are NREM-shape divergences: data fields the architecture treats as load-bearing for "temporal awareness" exist in the schema but are mostly empty or mistyped.
- **Notes:** HNSW index parameters match the doc. The vector dimension is 384 (matches `all-MiniLM-L6-v2`).
### `stage_2_queue`
- **Status:** Working (active queue feeding stage2_worker)
- **Columns:**
- `id integer NOT NULL` (PK, sequence)
- `source text NOT NULL UNIQUE`
- `full_text text NOT NULL` (no longer truncated post-F14)
- `char_length integer NOT NULL`
- `enqueued_at timestamptz NOT NULL default NOW()`
- `started_at`, `completed_at`, `failed_at` timestamptz nullable
- `failure_reason text`
- `attempts integer NOT NULL default 0`
- **Indexes:** PK + unique on `source`.
- **Row count:** 48 (25 completed, 21 failed, 2 pending)
- **Failure breakdown:**
- `park_pending_phase_2_reframe` — 19 rows (manually-marked, the parked meta-documents per the reframe)
- `mistral_timeout_after_300s` — 2 rows
- **Last enqueued:** 2026-05-02 22:22 UTC
- **Last completed:** 2026-05-02 22:33 UTC
- **Writes:** `watcher.py:enqueue_stage2()`, `ingest.py:enqueue_stage2()`, `corpus_integrity.py:queue_for_retry()`, `api.py:/api/corpus/retry`, `stage2_worker.py` (updates state)
- **Reads:** `stage2_worker.py:run()`
- **Behavior matches intent?** Yes. The queue is doing what it was redesigned to do post-F14. The 19 manually-parked rows match the reframe doc's mention of parked meta-documents.
- **Notes:** **The watcher is still actively enqueuing rows at 2026-05-02 22:22 — meaning Stage 2 is still consuming the queue and feeding Stage 3.** This is fine architecturally for now, but worth flagging given Stage 3 is stopped (Phase 2). See Phase 3 summary divergence #1.
### `stage_3_queue`
- **Status:** Working-degraded
- **Columns (base):**
- `id integer NOT NULL` (PK, sequence)
- `source text NOT NULL UNIQUE`
- `full_text text NOT NULL`
- `orientation text NOT NULL`
- `stage2_metadata jsonb`
- `enqueued_at timestamptz NOT NULL default NOW()`
- `started_at`, `completed_at`, `failed_at` timestamptz nullable
- `failure_reason text`
- `attempts integer NOT NULL default 0`
- **Columns (rolled-back-migration leftovers, all unused by current code):**
- `state_type text` (added by `30beeb3`, unused)
- `state_type_confidence text` (unused)
- `supersedes_prior_state boolean` (unused)
- `state_type_rationale text` (unused)
- `external_job_id uuid` (added by `a0bf280`, unused)
- **Indexes:**
- `stage_3_queue_pkey`
- `stage_3_queue_source_key` (unique on source)
- `stage_3_queue_supersedes_idx` btree on `supersedes_prior_state` — unused
- `idx_stage_3_queue_external_job` partial btree on `external_job_id` where not-null and not-completed/failed — unused
- **Row count:** 19 (11 completed, 3 failed, 6 pending). 1 row has `state_type` populated (the smoke-test); 0 have `external_job_id`.
- **Failure breakdown:**
- 2 × `HTTPConnectionPool(host='localhost', port=8001): Read timed out. (read timeout=600)` (the May-1 incident period)
- 1 × `Bulk path against new content unpatched; deferred until search_utils.py sites 4-9 are patched` (rolled-back work artifact)
- **Last enqueued:** 2026-05-02 22:33 UTC (Stage 2 just enqueued a row).
- **Writes:** `stage2_worker.py:enqueue_stage3()`, `stage3_worker.py` (state updates).
- **Reads:** `stage3_worker.py:run()`, `corpus_integrity.py:get_graphiti_sources()`, `api.py:get_corpus_status_data()`.
- **Behavior matches intent?** **Partial / multiple divergences.**
- 5 columns and 2 indexes from rolled-back migrations remain. Inert under current code, but they are visible to anyone reading the schema and will mislead. The current-state doc said `idx_stage_3_queue_supersedes` "may also still exist" — confirmed: it does, **plus** `idx_stage_3_queue_external_job` which the current-state doc didn't mention.
- The queue is filling without a consumer. Stage 3 worker is stopped (Phase 2); Stage 2 worker is enqueuing. As of 22:33 UTC there are 6 pending rows.
- **Notes:** Cleanup SQL is in the current-state doc. Track 1 candidate for removal (low priority — no harm in leaving).
### `graphiti_jobs`
- **Status:** Working-degraded (rolled-back-code artifact)
- **Columns:**
- `job_id uuid NOT NULL` (PK)
- `job_type text NOT NULL`
- `payload jsonb NOT NULL`
- `status text NOT NULL default 'queued'`
- `enqueued_at timestamptz NOT NULL default NOW()`
- `started_at`, `finished_at` timestamptz nullable
- `error text`
- `summary jsonb`
- `submitted_by text`
- **Indexes:**
- `graphiti_jobs_pkey`
- `idx_graphiti_jobs_queued` partial btree on `enqueued_at` where status='queued'
- `idx_graphiti_jobs_status` btree on `status`
- **Row count:** **9 (NOT empty)** — 6 failed, 3 committed.
- **Activity window:** All 9 jobs from 2026-05-02 02:26 UTC to 2026-05-02 05:50 UTC — last night's experimental run, before the rollback. Mix of `single` and `bulk` job types.
- **Writes:** None in current code. The Pattern 1 async-job consumer/producer was rolled back.
- **Reads:** None in current code.
- **Behavior matches intent?** **No.** The current-state doc said this table "exists, empty (or near-empty)". It is not empty — 9 jobs from the May-2 experimental run remain. They are inert (nothing reads or writes the table now), but the documented state and the actual state disagree. Drop the table per the current-state doc's cleanup SQL.
- **Notes:** Two of the 6 failures have `started_at IS NULL` and a non-null `finished_at` — those are jobs that were marked failed without ever being claimed by a worker. Pattern in the rolled-back code. Of historical interest only.
### `ingest_failures`
- **Status:** Working
- **Columns:**
- `id integer NOT NULL` (PK, sequence)
- `source text NOT NULL UNIQUE`
- `filepath text NOT NULL`
- `error text NOT NULL`
- `retry_count integer NOT NULL default 0`
- `first_failed_at`, `last_failed_at` timestamptz default NOW()
- `resolved boolean NOT NULL default false`
- `category text NOT NULL default 'transient'`
- **Indexes:** PK + unique on `source`.
- **Row count:** 129 (all `category='unreadable'`, all `resolved=false`)
- **Writes:** `watcher.py:record_ingest_failure()`, `corpus_integrity.py` (auto-queue path), `api.py:/api/corpus/retry`
- **Reads:** `api.py:get_corpus_status_data()`, `corpus_integrity.py:get_ingest_failures()`
- **Behavior matches intent?** Yes. Matches the architecture's "ingest_failures table for UI visibility" tech-debt-resolved entry. The 129 unreadable files match the 129 figure cited in the architecture doc — these are scanned/encrypted/corrupt PDFs awaiting OCR (priority 21b).
- **Notes:** The `category` field has only one observed value (`'unreadable'`); `'transient'` is the default but no rows currently carry it. Consistent with the architecture: only persistent failures (after watcher retry) make it here.
---
### Phase 3 summary
**Working and matching intent:**
- `ingest_failures` (129 unreadable, awaiting OCR, all matches doc)
- `stage_2_queue` (functioning queue, post-F14)
**Working with behavior-vs-intent divergences:**
- `embeddings` — 71% of rows have `type IS NULL`; 87% have `created_at IS NULL`; `created_at` is `text`-typed not timestamptz. The temporal-awareness commitment in the architecture is largely unsupported by the data actually in the table.
- `stage_3_queue` — five rolled-back-migration columns and two unused indexes remain; queue is being filled by Stage 2 with no consumer running.
**Broken / rolled-back:**
- `graphiti_jobs` — 9 rows from the rolled-back experimental work; current-state doc says "empty"; reality says otherwise. No current code touches it.
**Removal candidates (do not remove):**
- `stage_3_queue` columns: `state_type`, `state_type_confidence`, `supersedes_prior_state`, `state_type_rationale`, `external_job_id` and the two related indexes.
- `graphiti_jobs` table entirely.
- `embeddings.created_at` — under bespoke, the new substrate's temporal model replaces this; the column probably gets dropped in the bespoke build.
**NREM-shaped divergences in Phase 3:**
1. **Stage 2 still enqueues to Stage 3 while Stage 3 is stopped.** Pending count grows over time. There is no architectural-level decision to do this; it's a consequence of leaving Stage 2 running while turning off its consumer. The pending rows are inert until a consumer attaches, but the design says one queue stage feeds the next — and the consumer is gone. Same shape: a pipeline working "without errors" and producing state nobody is consuming.
2. **`embeddings.type` is NULL for 71% of rows.** The architecture treats `type` as a load-bearing field for distinguishing document vs conversation chunks at retrieval time. In production, more than two-thirds of chunks lack the field. Retrieval still works because nothing routes on `type`. The mechanism is in place, doing nothing visible, and the absence is invisible to anyone not querying the schema directly.
3. **`embeddings.created_at` is `text`-typed and 87% NULL.** Same shape: the doc treats temporal awareness as architectural; the data shape doesn't support time-based queries even where the column exists.
4. **`graphiti_jobs` documented as empty, actually has 9 rows.** Current-state doc explicitly anticipates the wrong state. Verifying the doc against the database surfaced this.
---
---
## Phase 4 — Configuration
### `~/aaronai/.env`
Eight keys present. **Values redacted in this document; only key name, length, and shape are reported.**
| Key | Length | Shape | Used by | Still referenced? |
|---|---|---|---|---|
| `ANTHROPIC_API_KEY` | 108 | opaque | `api.py` (Anthropic client), `dream.py:_call_claude`, `graphiti_service.py` (as fallback when `LLM_API_KEY` unset), several experiment scripts | Yes |
| `AARON_AI_PASSWORD` | 16 | opaque | `api.py:/auth/login` | Yes |
| `NEXTCLOUD_URL` | 36 | uri | `api.py` capture endpoint, `dream.py:deliver` | Yes |
| `NEXTCLOUD_USER` | 5 | opaque | Same as above | Yes |
| `NEXTCLOUD_PASSWORD` | 29 | opaque | Same — WebDAV app password | Yes |
| `PG_DSN` | 71 | opaque (postgres connection string) | Every Postgres-touching script (`api.py`, `dream.py`, `watcher.py`, `ingest.py`, `ingest_conversations.py`, both workers, `corpus_integrity.py`, `tier1_migration.py`, all experiment scripts) | Yes |
| `LLM_PROVIDER` | 9 | opaque (matches `"anthropic"`) | `graphiti_service.py:get_llm_client` | Yes (graphiti only) |
| `LLM_MODEL` | 25 | opaque (matches `"claude-sonnet-4-6"` length) | `graphiti_service.py` | Yes (graphiti only) |
**Variables documented in the architecture doc but NOT present in `.env`:**
- `LLM_API_KEY` — architecture doc table lists it. `graphiti_service.py` reads `LLM_API_KEY` first, falls back to `ANTHROPIC_API_KEY`. Current behavior depends on the fallback. Architecturally fine, but the "user brings their own key" LLM-agnostic framing (architecture doc Section 5) is achieved by a fallback rather than an explicit key. Track 1 candidate: either set `LLM_API_KEY` explicitly or remove the unused fallback path from the doc.
- `FALKORDB_HOST`, `FALKORDB_PORT`, `GRAPHITI_GROUP_ID` — referenced in `graphiti_service.py` with defaults (`localhost`, `6379`, `aaron`). Defaults are correct for current deployment; absence from `.env` is fine. Worth flagging only because the architecture doc lists `group_id="aaron"` as a single-tenant assumption (F26).
**Variables loaded but worth flagging:**
- All Postgres-touching scripts call `load_dotenv(Path.home() / "aaronai" / ".env", override=True)` (or without `override`). Different scripts use different override behavior; this is harmless but inconsistent.
**Behavior matches intent?** Partial. The `.env` file works; the documented LLM-agnostic story is a fallback story, not an enforced one. Permissions are `chmod 600` per the architecture commitment (file mode confirmed in earlier pass).
### `~/aaronai/settings.json`
Active contents:
```json
{
"theme": "light",
"font_size": "medium",
"web_search": true,
"show_sources": true
}
```
`api.py:DEFAULT_SETTINGS` (line 46) defines a wider key set:
```python
{
"theme": "light",
"font_size": "medium",
"web_search": True,
"show_sources": True,
"dream_hour_utc": 8,
"dream_minute_utc": 0,
"dream_mode": "nrem",
"ingest_hour_utc": 2,
"ingest_minute_utc": 30,
"share_time": True,
}
```
`load_settings()` merges file over defaults; `save_settings()` writes whatever it is given. The file currently holds only the four UI-tunable keys. The other six are loaded from defaults.
**What is referenced by current code:**
- `theme`, `font_size` — frontend only (Phase 6)
- `web_search``api.py:chat()` (line 307) — toggles the web_search tool block
- `show_sources``api.py:/api/chat` (line 521) — gates whether sources are returned in the chat response
- `dream_hour_utc`, `dream_minute_utc``api.py:reschedule_jobs()` (line 1149)
- `ingest_hour_utc`, `ingest_minute_utc``api.py:reschedule_jobs()` (line 1159)
- `dream_mode` — present in defaults; **not read anywhere in `api.py` or `dream.py`**. Searching the codebase: `dream_mode` appears only in `DEFAULT_SETTINGS` and the `schedule_keys` set in `update_settings`; `run_dream_job` always invokes `dream.py` with no flag (full pipeline). The setting is dead from the scheduler's perspective — it may be read by the frontend SettingsPanel for the default value of the on-demand "Dream Now" mode dropdown (Phase 6).
- `share_time`**frontend-controlled UI flag, backend stores-and-returns.** The backend persists it via `/api/settings` but does not act on its value. Frontend reads it at `MessageInput.tsx:58` and `SettingsPanel.tsx:205` (both with `?? true` fallback) and writes it back through the SettingsPanel toggle. The flag gates whether `client_time` is included in the `/api/chat` request payload (`lib/api.ts:51-57`); when off, the request omits the key and the backend's unconditional prompt-side insertion at `chat()` line 293 has nothing to insert. *Verified by cross-repo grep 2026-05-02 — the original "frontend-only or dead" / "removal candidate" framing was wrong; this is a working persistence pattern, structurally distinct from `dream_mode`.*
**Behavior matches intent?** Partial — but the two suspect keys behave very differently and should not be lumped together. **`dream_mode` is a NREM-shape divergence:** it reads as a configurable scheduling parameter (declared in `DEFAULT_SETTINGS`, listed in `schedule_keys` for the reschedule trigger), but `run_dream_job` ignores it. A future maintainer flipping the value expects different nightly behavior and gets none. **`share_time`, in contrast, is a backend-stores-and-returns persistence pattern** — the backend correctly persists a frontend-owned flag and the frontend acts on it (with a `?? true` fallback if the key is missing). The distinction matters: removing a silently-ignored key removes dead code, while removing a stores-and-returns key changes the seed default for new users. *Verification finding 2026-05-02 (cross-repo grep against `~/aaronai-web`).*
---
### Phase 4 summary
**Working and matching intent:**
- All eight `.env` keys are referenced by code.
- The four-key `settings.json` reflects the UI-tunable preferences.
**Working with behavior-vs-intent divergences:**
- `LLM_API_KEY` documented but not set; relies on `ANTHROPIC_API_KEY` fallback.
- `dream_mode` exists in defaults but isn't read by the scheduler.
**Removal candidates (do not remove):**
- `dream_mode` — clarify in code or remove from defaults. *(`share_time` was previously listed here in error; cross-repo grep 2026-05-02 confirmed it is a working frontend-controlled flag, not a removal candidate.)*
**NREM-shaped divergences in Phase 4:**
1. **`dream_mode` setting silently ignored.** A scheduler-shaped knob that exists, has a default, is mergeable from settings.json, and is not used. Future maintainer flipping it expects different nightly behavior; gets none.
---
---
## Phase 5 — Cron and scheduled work
### User crontab (`crontab -l`)
Two active entries:
| Schedule | Command | What it does |
|---|---|---|
| `0 3 * * *` (daily 03:00 UTC) | `/bin/bash /home/aaron/aaronai/scripts/backup.sh` | Snapshots `memory.md`, `settings.json`, `conversations.db` into `Nextcloud/Admin/Backups/`. 7-day retention. |
| `*/5 * * * *` (every 5 min) | `test $(( $(date +%s) - $(cat /home/aaron/aaronai/watcher_heartbeat 2>/dev/null || echo 0) )) -gt 600 && sudo systemctl restart aaronai-watcher >> /var/log/aaronai/watcher-cron.log 2>&1` | Heartbeat watchdog. Restarts the watcher service if the heartbeat file is older than 600 seconds. |
**Behavior matches intent?** Yes. The watcher heartbeat watchdog corresponds to the architecture-doc tech-debt entry "Heartbeat file written every 5s … cron job restarts watcher if heartbeat older than 10 minutes." The 600s threshold matches the doc's "10 minutes" figure. `backup.sh` is on the documented daily schedule.
**Notes:** The watcher-restart entry uses passwordless `sudo` for `systemctl restart aaronai-watcher`. This is **not** in `/etc/sudoers.d/aaron-aaronai` (which the session brief lists as containing `restart ollama` and `restart aaronai-graphiti.service`). Either it's in `/etc/sudoers` proper (the original `aaronai-web` line area), or the cron entry is silently failing on every fire. Worth verifying — the cron line redirects stderr to the log, so a `sudo: password required` would be in `watcher-cron.log` (which I haven't read here).
### `/etc/cron.d/`
Stock OS files only: `certbot`, `e2scrub_all`, `sysstat`, plus the standard `cron.daily`/`cron.weekly`/`cron.hourly` directories with default Ubuntu cron jobs (`apport`, `apt-compat`, `dpkg`, `logrotate`, `man-db`, `sysstat`). **No aaronai-specific entries** in `/etc/cron.d/` or anywhere outside the user crontab.
`/etc/anacrontab` is not present.
Root crontab not inspected (sudo required; not granted in this read-only inventory pass).
### APScheduler jobs in `api.py`
`api.py:reschedule_jobs()` (line 1137) configures two jobs against an in-process `BackgroundScheduler`. The scheduler starts in the FastAPI lifespan; jobs are re-registered any time settings that contain a schedule key are updated.
| Job ID | Trigger | Function | What it does |
|---|---|---|---|
| `dream_job` | Cron, `hour=settings.dream_hour_utc`, `minute=settings.dream_minute_utc`, `tz=UTC` (default 08:00) | `run_dream_job` (line 1107) | `subprocess.run([PYTHON, dream.py], timeout=600)` — invokes the dreamer with no arguments → defaults to full pipeline (NREM → Early REM → Late REM → Synthesis). |
| `ingest_job` | Cron, `hour=settings.ingest_hour_utc`, `minute=settings.ingest_minute_utc`, `tz=UTC` (default 02:30) | `run_ingest_job` (line 1123) | `subprocess.run([PYTHON, ingest_conversations.py], timeout=300)`. |
Both `max_instances=1`, both `replace_existing=True`. Settings changes that touch the schedule keys re-register the jobs.
**Behavior matches intent?** Mostly yes. The architecture's "Nightly Schedule" section says 02:30 UTC for conversation indexing and 08:00 UTC for the dream pipeline; both match. **One divergence:** `run_dream_job` uses `subprocess.run` (synchronous, with a 600s timeout). For a normal full-pipeline run this is enough, but Phase 5 of the reframe / E6 work would want longer runs — this is a soft cap nobody has hit yet. Architecture doc doesn't specify; flagging in case future longer runs need a bump.
**Notes:** The 600s `subprocess.run` timeout is the only thing protecting the FastAPI process from a stuck dreamer. If the dreamer hangs (e.g., Anthropic API stall), the scheduler thread holds for 10 minutes before the timeout fires. Acceptable but worth knowing.
### Systemd timers
Already documented in Phase 2 — three timer files exist (`aaronai-dreamer.timer`, `aaronai-index-conversations.timer`, `aaronai-maintenance.timer`), **none of them enabled** (none in `/etc/systemd/system/timers.target.wants/`). They duplicate (or, for maintenance, point at a broken service). APScheduler is the actual driver for the two paths the dreamer/ingest timers would cover.
### What is *not* scheduled
The architecture and reframe documents reference several mechanisms that have no scheduled runner today:
- **Asynchronous dreamer pruning pass** (per reframe). Designed but unimplemented; no schedule.
- **Consolidator 0.1 alias resolution.** The script exists, has no schedule, was always run by hand. Track 1 will dissolve it.
- **`corpus_integrity.py` reconciliation.** Designed to be runnable on demand or via the SettingsPanel. No automated weekly run; the 129 unreadable files have been sitting at zero `retry_count` since the OCR (priority 21b) hasn't shipped.
- **`tier1_migration.py`** has no schedule (one-shot, already complete).
---
### Phase 5 summary
**Working and matching intent:**
- User crontab (backup + watcher heartbeat watchdog).
- APScheduler jobs (dream + ingest_conversations) match the architecture doc's nightly schedule.
**Working with behavior-vs-intent divergences:**
- The watcher-restart cron uses `sudo systemctl restart aaronai-watcher`, but the only sudoers entry for aaron is for ollama and aaronai-graphiti. The line either depends on a sudoers entry not documented in the session brief, or fails silently. **Worth verifying as part of Track 1.**
- `dream_job` uses 600s `subprocess.run` timeout — soft cap nobody has hit, but tightens the operational envelope for any future longer-running dream work.
**Stopped / dormant:**
- All three `aaronai-*.timer` units (Phase 2). They are configured, not enabled, and overlap APScheduler.
**Removal candidates (do not remove):**
- The three `aaronai-*.timer` files.
**NREM-shaped divergences in Phase 5:**
1. **Watcher-restart sudo path.** The cron entry was probably added on the assumption that `aaron` had broad NOPASSWD sudo for systemctl, which the 2026-05-01 sudoers fix narrowed to specific commands. If the `aaronai-watcher` restart isn't in sudoers, the watchdog has been silently failing. Whether or not it has, this is the same shape: a recovery mechanism configured, configured to look like it works, possibly not working. The session brief and the architecture doc didn't cross-check it.
2. **Two parallel scheduling stacks.** APScheduler in api.py drives nightly work; three systemd `.timer` files exist but are not enabled. The duplication makes "what triggers a dream" harder to answer than it should be.
---
---
## Phase 6 — Frontend routes
Next.js app router under `~/aaronai-web/app/`. Three user-facing routes plus a catch-all API proxy.
| Route | File | Auth | What it does | Backend support? |
|---|---|---|---|---|
| `/` | `app/page.tsx` | Required (cookie redirect to `/login`) | Main chat UI, sidebar, settings panel, dreamer status, corpus integrity status. | Yes — every backed `/api/*` endpoint is proxied through the catch-all. |
| `/login` | `app/login/page.tsx` | None | Password login, sets `aaronai_session` cookie. | Yes — `POST /auth/login`. |
| `/capture` | `app/capture/page.tsx` | None (mobile field-recorder, public) | Voice + image capture, posts to `/api/capture`. SSE listener on `/api/captures/events`. | Yes. |
| `/api/[...slug]` | `app/api/[...slug]/route.ts` | Pass-through | Catch-all proxy: forwards every request to `${API_URL || 'https://ai.aaronnelson.studio'}/api/<slug>` (or `/<slug>` for `auth/*`). Forwards `cookie`, `content-type`, `set-cookie`. | Always — it is the proxy. |
That is the entire route surface. The frontend has no static `/dreams`, `/journal`, `/admin`, etc.; all dream output is delivered via Nextcloud and read out-of-band. The only data path between frontend and Aaron is chat, capture, and the SettingsPanel embedded in `/`.
**Behavior matches intent?** Yes against the architecture doc's Layer 3 list ("Login/logout … Chat desktop and mobile … Sidebar … Voice: tap-to-toggle … `/capture` voice + image"). The doc's "Not yet built" entries (Consolidation agent UI, drag-and-drop capture, LLM provider selector) are correctly absent.
**Notes:**
- The catch-all proxy uses `process.env.API_URL` and falls back to `'https://ai.aaronnelson.studio'`. In production this is fine because the frontend talks back through the public domain (which nginx routes back to the same machine). Architecturally a bit roundabout (frontend → public DNS → nginx → backend on same host) but the deploy is consistent with what's documented.
- I did not deep-read the route components or the `components/` directory — per Phase 6 scope ("don't go deep").
### Phase 6 summary
**Working and matching intent:** Three routes, all backed.
**Removal candidates:** None at this layer.
**NREM-shaped divergences:** None observed at the route level. (Component-level divergences would require deeper inspection.)
---
-105
View File
@@ -1,105 +0,0 @@
# OCR install record — 2026-05-04
## Machine
- Host: aaronai-01 (VPS)
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
## apt packages installed
| package | version | source |
|---|---|---|
| tesseract-ocr | 5.3.4-1build5 | noble |
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
## pip packages installed (into /home/aaron/aaronai/venv)
| package | version |
|---|---|
| pytesseract | 0.3.13 |
| ocrmypdf | 17.4.2 |
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
## Smoke test 1 — `tesseract --version`
```
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX512BW
Found AVX512F
```
## Smoke test 2 — `tesseract --list-langs`
```
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
eng
osd
```
## Smoke test 3 — pytesseract on a slide image
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
- Wall-clock: 0.220s
- Chars extracted: 126
- First 200 chars:
```
Generates the Bounding Box for NESS
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
```
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
- Wall-clock: 3.72s (whole PDF, 4 pages)
- Exit: 0
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
- First 200 chars of OCR'd text:
```
nN New Paltz
STATE UNIVERSITY OF NEW YORK
The Honors Program
May 30, 2017
Dear Aaron,
Thank you for serving as a reader for Caryn Byllotts thesis on "Recall/Reconstruct: The Exploration of
Memory
```
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
## Reference timing
| operation | input size | wall-clock |
|---|---|---|
| pytesseract single image | 1768×504 PNG | 0.22s |
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
## Deferred — project dep-tracking
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
## Followups
- Async OCR worker (separate session). Use the reference timing above to size the queue.
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
- Project dep-manifest decision (see Deferred section above).
-194
View File
@@ -1,194 +0,0 @@
# scripts/ reorganization plan — 2026-05-02
*Track 1 Bucket B fix #4 — read-only proposal. Nothing moved or deleted yet. Approve before executing.*
## Summary
The `~/aaronai/scripts/` directory currently holds **41** `.py`/`.sh` files. Reading the listing it is hard to tell which files are live workers and which are completed-experiment artifacts. The proposed split:
| Bucket | Count | Destination |
|---|---|---|
| Production (stay) | 11 | `scripts/` |
| Experimental (move) | 28 | `scripts/experiments/` (already exists, holds 4 files; will hold 32) |
| Deprecated (move) | 2 | `scripts/deprecated/` (new) |
| `.bak*` to delete | 19 | git history is the durable record |
| Uncertain | 0 | n/a |
After execution, `ls scripts/*.py scripts/*.sh` should return only the 11 production files plus the two subdirectories.
## Reference checks performed
Before producing this plan I grepped:
- `subprocess` calls inside `api.py` for paths under `scripts/`
- `import` and string-path references inside every production script
- `ExecStart=` lines across every `aaronai-*.service` in `/etc/systemd/system/`
- The user crontab for any line invoking a `scripts/` path
**Findings:**
- The only scripts referenced from `api.py` are `ingest.py` (line 43, `INGEST_SCRIPT`), `dream.py` (lines 661 and 1111), `ingest_conversations.py` (line 1127), and `corpus_integrity.py` (line 934, `CORPUS_INTEGRITY_SCRIPT`).
- `api.py` (line 937) and `corpus_integrity.py` (line 29) reference the data file `~/aaronai/experiments/tier1_migration_state.json` — that path is the **state file** in `~/aaronai/experiments/`, not the script. Moving `tier1_migration.py` does not break either reader.
- No production script imports or shells out to any experimental file.
- All eight `aaronai-*.service` units' `ExecStart` lines point at production scripts only.
- The user crontab references `backup.sh` and `aaronai-watcher` (a service) — no experimental files.
So the reorganization is safe at the reference level for every file in section B (experiments), C (deprecated), and D (delete). No moves change a runtime code path.
---
## A — PRODUCTION (stay in `scripts/`)
These 11 files are constraint-locked or referenced by an active runtime mechanism. None moves.
| File | Why it stays |
|---|---|
| `api.py` | `aaronai.service` ExecStart; long-running FastAPI backend; APScheduler. |
| `dream.py` | `aaronai-dreamer.service` ExecStart; called by APScheduler in `api.py`; called by `/api/dreamer/run`. |
| `watcher.py` | `aaronai-watcher.service` ExecStart; Stage 1 of the encoding pipeline. |
| `stage2_worker.py` | `aaronai-stage2.service` ExecStart. |
| `stage3_worker.py` | `aaronai-stage3.service` ExecStart (service is currently stopped, but the unit is enabled and the file is the unit's ExecStart). |
| `graphiti_service.py` | `aaronai-graphiti.service` ExecStart. |
| `ingest.py` | `INGEST_SCRIPT` constant in `api.py`; `/api/reindex` shells out to it. |
| `ingest_conversations.py` | `aaronai-index-conversations.service` ExecStart **and** APScheduler `ingest_job` in `api.py`. |
| `corpus_integrity.py` | `CORPUS_INTEGRITY_SCRIPT` constant in `api.py`; `/api/corpus/reconcile` shells out to it. |
| `st_embedder.py` | Imported by `graphiti_service.py` at sidecar startup (`SentenceTransformerEmbedder`). |
| `backup.sh` | User crontab `0 3 * * *` daily snapshot of `memory.md`, `settings.json`, `conversations.db`. |
---
## B — MOVE TO `scripts/experiments/`
28 files. None is referenced by any production code, systemd unit, or cron job.
For brevity, the "Why" column gives the experiment identity — full per-file write-ups are in the inventory's Phase 1 experimental table. The "Referenced by" column is the result of the grep against api.py / systemd ExecStart lines / cron / production scripts; "(none in production)" means no production code references it.
| Current path | Action | Why | Referenced by |
|---|---|---|---|
| `scripts/audit_expansion_draw.py` | move → `scripts/experiments/` | Type-aware stratified draw for n=20 audit expansion (sample-construction tool for `base_class_audit_rerun.py`). | (none in production) |
| `scripts/base_class_test.py` | move → `scripts/experiments/` | Base-class enrichment OOP framing experiment, n=20. | (none in production) |
| `scripts/base_class_validation.py` | move → `scripts/experiments/` | Base-class enrichment validation, n=50. | (none in production) |
| `scripts/base_class_audit_rerun.py` | move → `scripts/experiments/` | Base-class n=8 paired-extraction audit. | (none in production) |
| `scripts/briefing_generator_v2.py` | move → `scripts/experiments/` | Experiment 002b — briefing v2; validated 96% Mistral structural pattern. | (none in production) |
| `scripts/briefing_test.py` | move → `scripts/experiments/` | Experiment 002 — briefing v1; superseded by v2. | (none in production) |
| `scripts/cascade_test.py` | move → `scripts/experiments/` | Entity-drafter cascade n=20; falsified. | (none in production) |
| `scripts/cascade_optimization_test.py` | move → `scripts/experiments/` | Optimized entity-drafter cascade n=30; confirmed entity-drafter cascade is dead. | (none in production) |
| `scripts/consistency_test.py` | move → `scripts/experiments/` | Experiment 001 — Mistral 3-pass consistency, n=50. | (none in production) |
| `scripts/consistency_test_v2.py` | move → `scripts/experiments/` | Experiment 003 — entity-only consistency with corrected sampling. | (none in production) |
| `scripts/cost_test_graphiti_bulk.py` | move → `scripts/experiments/` | Bulk endpoint cost test, n=50. | (none in production) |
| `scripts/cost_test_graphiti_bulk_retry.py` | move → `scripts/experiments/` | Retry of failed bulk batches (pre-MAX_QUEUED_QUERIES bump). | (none in production) |
| `scripts/cost_test_graphiti_bulk_retry2.py` | move → `scripts/experiments/` | Second retry attempt, smaller batches. | (none in production) |
| `scripts/cost_test_graphiti_migration.py` | move → `scripts/experiments/` | Single-episode migration cost test, n=50. | (none in production) |
| `scripts/e1_select_sample.py` | move → `scripts/experiments/` | E1 sample selection. | (none in production) |
| `scripts/e1_run_cascade.py` | move → `scripts/experiments/` | E1 cascade orchestration (initial). | (none in production) |
| `scripts/e1_run_cascade_corrected.py` | move → `scripts/experiments/` | E1 corrected (custom_extraction_instructions path). | (none in production) |
| `scripts/e1_per_source_predicates.py` | move → `scripts/experiments/` | E1 per-source predicate count, corrected metric. | (none in production) |
| `scripts/e1_compare_metrics.py` | move → `scripts/experiments/` | E1 A vs B metrics comparison. | (none in production) |
| `scripts/e14_select_sample.py` | move → `scripts/experiments/` | E1.4 stratified sample selection (n=30). | (none in production) |
| `scripts/e14_run_cascade.py` | move → `scripts/experiments/` | E1.4 cascade orchestration. | (none in production) |
| `scripts/e14_per_source_predicates.py` | move → `scripts/experiments/` | E1.4 per-source predicate diversity. | (none in production) |
| `scripts/e16_rate_purity.py` | move → `scripts/experiments/` | E1.6 domain-purity human rating UI; surfaced taxonomic-mismatch finding. | (none in production) |
| `scripts/e16_analyze.py` | move → `scripts/experiments/` | E1.6 Spearman correlation against E1.4. | (none in production) |
| `scripts/e2_resolution_check.py` | move → `scripts/experiments/` | E2 entity-resolution diagnostic on six test entities. | (none in production) |
| `scripts/e2_alias_followup.py` | move → `scripts/experiments/` | E2 alias follow-up (Aaron AI variants etc.). | (none in production) |
| `scripts/e2_source_diversity.py` | move → `scripts/experiments/` | E2 episode count per entity. | (none in production) |
| `scripts/token_measurement_test.py` | move → `scripts/experiments/` | Experiment 005 — token reduction measurement. | (none in production) |
`scripts/experiments/` already contains four files (`e1_8_eval.py`, `e1_8_taxfree_cascade.py`, `e1_9_retroactive.py`, `e3_dreamer_substrate.py`); after the move it holds 32. **No collisions** between current `scripts/` filenames and existing `scripts/experiments/` filenames — verified by the file lists.
---
## C — MOVE TO `scripts/deprecated/`
Two files. New directory `scripts/deprecated/` is created. Per the user constraint on tier1, both are flagged.
| Current path | Action | Why | Referenced by |
|---|---|---|---|
| `scripts/consolidator_v0_1.py` | move → `scripts/deprecated/` | The reframe doc explicitly identifies "consolidator-as-separate-system" as the architectural mistake (its function moves into the dream phase). The 0.1 calibration findings (2026-04-29) showed alias-resolution-from-graph-features-alone has structural problems on this corpus that threshold tuning cannot address. Bespoke decision dissolves the layer. | (none in production); `scripts/consolidator_v0_1.py.bak` is in section D. |
| `scripts/tier1_migration.py` | move → `scripts/deprecated/` | One-shot completed 2026-04-30 (1,205 sources, 4,990 nodes, 22,289 edges). Under the bespoke decision the substrate this migrated **to** is being replaced; re-running the script against the bespoke substrate would not be the right move. **Flag (per Tier1 constraint):** the script's state file at `~/aaronai/experiments/tier1_migration_state.json` IS still consumed — `corpus_integrity.py:29` and `api.py:937` read it for the "graphiti coverage" report. **Moving the script does not affect the state file** (the state file lives in `~/aaronai/experiments/`, not `~/aaronai/scripts/`). The reader-vs-writer separation makes this safe. | (none in production); state file `~/aaronai/experiments/tier1_migration_state.json` consumed by `corpus_integrity.py` + `api.py`, not the script itself |
---
## D — DELETE (`.bak*` files)
19 files. Git history is the durable record of every prior version. Removing `.bak*` files is a cleanup, not a loss.
For each: action is `rm`. None is referenced by any production path.
| File | Approximate purpose |
|---|---|
| `scripts/api.py.bak.20260501-001427` | Pre-CV-pinning-strip / pre-F1 snapshot. |
| `scripts/consolidator_v0_1.py.bak` | Pre-0.1.5-patch (Jaccard, before containment metric). |
| `scripts/corpus_integrity.py.bak.20260501-021703` | Pre-F14 truncation snapshot. |
| `scripts/dream.py.bak` | Older dreamer (pre v1.1 score-band). |
| `scripts/dream.py.bak.20260501-002209` | Pre-F1 dreamer. |
| `scripts/graphiti_service.py.bak` | Pre-bulk-saga sidecar. |
| `scripts/graphiti_service.py.bak.20260501-185619` | Mid-rollback snapshot. |
| `scripts/graphiti_service.py.bak.20260502-022307` | Mid-rollback snapshot (rolled-back work). |
| `scripts/ingest.py.bak.20260501-004131` | Pre-F14 truncation snapshot. |
| `scripts/stage2_worker.py.bak.20260501-171928` | v2.0 → v2.1 transition. |
| `scripts/stage2_worker.py.bak.20260501-172531` | v2.1 patch step. |
| `scripts/stage2_worker.py.bak.20260501-185942` | v2.1 patch step. |
| `scripts/stage3_worker.py.bak.20260501-050354` | Pre-saga-split. |
| `scripts/stage3_worker.py.bak.20260501-050453` | Pre-saga-split. |
| `scripts/stage3_worker.py.bak.20260501-050719` | Pre-saga-split. |
| `scripts/stage3_worker.py.bak.20260501-173233` | Mid-v2.1. |
| `scripts/stage3_worker.py.bak.20260501-190357` | v2.1 final. |
| `scripts/watcher.py.bak` | Pre-in-process refactor (2026-04-30). |
| `scripts/watcher.py.bak.20260501-004131` | Pre-F14 truncation snapshot. |
Stage 3 alone has five `.bak` versions; Stage 2 has three. Both are visible in `git log` for the corresponding production files — no information is lost.
---
## E — UNCERTAIN
None. Every file in `scripts/` is classified above. The grep against api.py / systemd / cron / production scripts produced clean answers for each.
The `scripts/__pycache__/` directory exists and contains `.pyc` for `api`, `corpus_integrity`, `dream`, `ingest`, `stage3_worker`, `st_embedder`, `watcher` (notably no `.pyc` for `stage2_worker.py`). Not part of this plan, but Python regenerates `.pyc` on next import — `__pycache__/` is safe to remove at any time and has no bearing on the moves above. **Recommended but not in this plan: `rm -rf scripts/__pycache__/` after the moves complete, so stale entries for moved files don't linger.**
---
## Execution-step preview (NOT executed in this turn)
For when the plan is approved, the proposed mechanic is:
```bash
mkdir -p ~/aaronai/scripts/deprecated/
# Section B — 28 moves to scripts/experiments/
git mv scripts/{audit_expansion_draw,base_class_test,base_class_validation,base_class_audit_rerun, \
briefing_generator_v2,briefing_test, \
cascade_test,cascade_optimization_test, \
consistency_test,consistency_test_v2, \
cost_test_graphiti_bulk,cost_test_graphiti_bulk_retry,cost_test_graphiti_bulk_retry2,cost_test_graphiti_migration, \
e1_select_sample,e1_run_cascade,e1_run_cascade_corrected,e1_per_source_predicates,e1_compare_metrics, \
e14_select_sample,e14_run_cascade,e14_per_source_predicates, \
e16_rate_purity,e16_analyze, \
e2_resolution_check,e2_alias_followup,e2_source_diversity, \
token_measurement_test}.py scripts/experiments/
# Section C — 2 moves to scripts/deprecated/
git mv scripts/consolidator_v0_1.py scripts/tier1_migration.py scripts/deprecated/
# Section D — 19 deletes
rm scripts/*.bak*
# Section E recommendation (post-move)
rm -rf scripts/__pycache__/
```
`git mv` keeps git history. After execution, a single commit with a body listing each move and delete (no Co-Authored-By trailer) would land the change.
---
## What this plan does NOT do
- Does not modify `api.py`, `corpus_integrity.py`, `tier1_migration.py`, or any other code. The `MIGRATION_STATE` path in `corpus_integrity.py:29` and the matching constant in `api.py:937` continue to point at `~/aaronai/experiments/tier1_migration_state.json` — unchanged by the move.
- Does not modify any systemd unit. Every `ExecStart` continues to point at a `scripts/<production>.py` path that remains valid.
- Does not touch the user crontab.
- Does not touch `~/aaronai/db/` (separate decision flagged in inventory; ChromaDB-era 550M directory).
- Does not delete `scripts/__pycache__/` (recommendation only).
- Does not touch the four files already in `scripts/experiments/` (`e1_8_eval.py`, `e1_8_taxfree_cascade.py`, `e1_9_retroactive.py`, `e3_dreamer_substrate.py`).
## Awaiting approval
Tell me to proceed and I will execute Sections B → C → D in order, then run `git status` and `git diff --stat` so you can review before the commit. No commit will be made until you give the second go-ahead.
-175
View File
@@ -1,175 +0,0 @@
# Stage 2 Frame Analysis — 2026-05-03
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
---
## Verdict
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
---
## 1. Corpus-wide frame coverage
| Class | Count | % of corpus | Frame coverage |
|---|---|---|---|
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
| Files that failed Stage 2 | 12 | 1.0% | none |
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
## 2. Frame distribution (the docs that DO have frames)
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
| Frame | Count | % of frame-extracted docs |
|---|---|---|
| Education | 238 | 35.6% |
| Course | 58 | 8.7% |
| Programming | 43 | 6.4% |
| Design | 32 | 4.8% |
| Professional Experience | 24 | 3.6% |
| Employment | 24 | 3.6% |
| Research | 23 | 3.4% |
| 3D Printing | 22 | 3.3% |
| Project, Grading, Art, Budget | 21 each | 3.1% |
| Academic Integrity | 20 | 3.0% |
| Teaching, Technology, Attendance, Application | 1319 | — |
| Accommodation, Manufacturing, Coursework, Recommendation | 1013 | — |
**Per-doc frame count:** median 34 frames per doc; 76% of docs have 35 frames; one outlier doc has 30 frames (Mistral over-segmented).
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
## 3. Label hygiene
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
| Concept | Variant counts |
|---|---|
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
| Course Design | `Course Design`:9 + `Course_Design`:1 |
| Project Management | `Project Management`:7 + `Project_Management`:1 |
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
| (… 48 more) | |
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
## 4. Worker version drift
| Worker version | Doc count | Notes |
|---|---|---|
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
| v2.0 | 3 | Same key shape as v2.1 baseline. |
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
## 5. File-type signal
This is the most useful Track 2 finding from this report.
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
| Frame | pdf | docx | pptx | markdown | txt | dream |
|---|---|---|---|---|---|---|
| Education | 116 | 119 | 3 | — | — | — |
| Course | 29 | 29 | — | — | — | — |
| Programming | 12 | 10 | **15** | — | 6 | — |
| Application | **13** | 2 | — | — | — | — |
| 3D Printing | 11 | 3 | **8** | — | — | — |
| Manufacturing | 3 | 6 | 4 | — | — | — |
| Research | 9 | 13 | — | 1 | — | — |
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
## 6. Systematic exclusions inside the 339-doc gap
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
| Type | Count | What this is |
|---|---|---|
| pdf | 110 | Short PDFs (forms, single-page docs) |
| docx | 110 | Short Word docs |
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
| pptx | 31 | Short slide decks |
| txt | 28 | Plain-text files |
| voice_note | 14 | **Every voice note in the corpus** |
| markdown | 7 | Short markdown |
**Two specific systematic exclusions worth naming separately:**
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
---
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
**Defined scope (the coverage caveat):**
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
---
## 8. Recommended follow-ups (ordered by ROI)
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
---
## 9. Inventory edits flagged for session-end batch
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
## 10. Reproduction
```bash
cd ~/aaronai
venv/bin/python3 scripts/experiments/frame_distribution_report.py
# stdout: human-readable report
# json: experiments/frame_distribution_<date>.json
# view: stage2_frames_v (in pgvector DB)
```
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
@@ -1,857 +0,0 @@
{
"generated_at": "2026-05-03T23:47:54.802182+00:00",
"section_1": {
"overall": {
"total": 14069,
"type_null": 9815,
"ca_null": 12109,
"both_null": 9815,
"both_set": 1960
},
"cohorts": [
{
"type": "aaronai_conversation",
"ca_null": false,
"n": 71
},
{
"type": "chatgpt_conversation",
"ca_null": true,
"n": 1548
},
{
"type": "claude_conversation",
"ca_null": false,
"n": 1074
},
{
"type": "claude_memory",
"ca_null": true,
"n": 1
},
{
"type": "document",
"ca_null": false,
"n": 815
},
{
"type": "document",
"ca_null": true,
"n": 745
},
{
"type": null,
"ca_null": true,
"n": 9815
}
]
},
"section_2": {
"by_ext": [
{
"ext": ".pdf",
"rows": 6886
},
{
"ext": ".txt",
"rows": 1501
},
{
"ext": ".docx",
"rows": 1048
},
{
"ext": ".pptx",
"rows": 353
},
{
"ext": ".md",
"rows": 27
}
],
"classified": 9815,
"unclassifiable": 0
},
"section_3": {
"watcher_state_paths": 1462,
"watcher_state_basenames": 1183,
"watcher_state_collisions": 109,
"rows_with_filepath": {
"total": 9816,
"exists": 9649,
"missing": 167,
"outside_root": 0,
"sample": [
{
"id": "f317f238_0",
"source": "NO thesis proposal.docx",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF790 Thesis/Nic OConnor/NO thesis proposal.docx",
"mtime": "2024-01-26T15:06:09Z"
},
{
"id": "81047646_0",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "81047646_1",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "4e49d3b4_4",
"source": "Circuit Intro.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF310 Mechatronics/Week 1/Circuit Intro.pdf",
"mtime": "2022-01-31T23:28:56Z"
},
{
"id": "81047646_2",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
}
]
},
"rows_without_filepath": {
"total": 744,
"distinct_basenames": 228,
"unique_hit": 211,
"collision_hit": 16,
"unfound": 1
},
"collision_shapes": {
"total": 109,
"shape_counts": {
"multi-live": 95,
"live+archive": 14
},
"rows_affected_by_shape": {
"multi-live": 85,
"live+archive": 0
},
"samples": {
"multi-live": [
{
"name": "README.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/README.md",
"mtime": "2026-04-25T17:08:01Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Processing/Nature of Code/The-Nature-of-Code-Examples/The-Nature-of-Code-Examples-master/README.md",
"mtime": "2017-03-09T23:32:59Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/samples/hal/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/platforms/maven/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/hal/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/carotene/README.md",
"mtime": "2016-12-21T10:37:02Z"
}
]
},
{
"name": "3DPrinting_v2.pptx",
"rows_no_fp_using_this_name": 4,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Innovation Center/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:49Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Cuba/Assets/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:18Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Conference/3D Printing/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:15Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Workshops/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:30:14Z"
}
]
},
{
"name": "Print in Place.docx",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF205 CAD1/Print in Place.docx",
"mtime": "2017-08-24T03:50:36Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/ARS393 CVS1/Print in Place.docx",
"mtime": "2015-10-28T20:36:52Z"
}
]
}
],
"live+archive": [
{
"name": "dreamer-design-spec.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
}
]
},
{
"name": "BirdAI-Ingest-Architecture.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
}
]
},
{
"name": "graphiti-migration-plan.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/Migration Plans/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
}
]
}
]
}
}
},
"section_4": {
"export_dir_exists": true,
"files": [
{
"name": "conversations-000.json",
"size": 19050556,
"mtime": "2026-04-24T19:55:44Z"
},
{
"name": "conversations-001.json",
"size": 29057594,
"mtime": "2026-04-24T19:55:44Z"
}
],
"convo_index_size": 169,
"sample_results": [
{
"id": "chatgpt_87cc0c47-aaf9-42da-8169-3b8922f3afba_0",
"source": "ChatGPT: Dog named Bird",
"convo_id": "87cc0c47-aaf9-42da-8169-3b8922f3afba",
"create_time": 1708835138.51948,
"create_time_iso": "2024-02-25T04:25:38.519480Z",
"resolved": true
},
{
"id": "chatgpt_689fab3e-d79c-8333-aeb5-7da4e9ca160d_0",
"source": "ChatGPT: Video understanding limitations",
"convo_id": "689fab3e-d79c-8333-aeb5-7da4e9ca160d",
"create_time": 1755294541.894811,
"create_time_iso": "2025-08-15T21:49:01.894811Z",
"resolved": true
},
{
"id": "chatgpt_611ff391-7fc0-42ea-bfd9-18dbe1739f19_7",
"source": "ChatGPT: Calculating Truncated Cone Angle",
"convo_id": "611ff391-7fc0-42ea-bfd9-18dbe1739f19",
"create_time": 1724020869.471264,
"create_time_iso": "2024-08-18T22:41:09.471264Z",
"resolved": true
},
{
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_50",
"source": "ChatGPT: Soul music playlist ideas",
"convo_id": "68ce1921-084c-8330-877c-78df1e03e54c",
"create_time": 1758337313.438344,
"create_time_iso": "2025-09-20T03:01:53.438344Z",
"resolved": true
},
{
"id": "chatgpt_c02e94f0-17db-4fd9-be04-13aaa1b728cb_1",
"source": "ChatGPT: Create Rhino plugin in Python",
"convo_id": "c02e94f0-17db-4fd9-be04-13aaa1b728cb",
"create_time": 1682716259.557353,
"create_time_iso": "2023-04-28T21:10:59.557353Z",
"resolved": true
}
],
"sample_resolved": 5,
"full_cohort": {
"distinct_convo_ids": 168,
"resolvable_from_export": 168,
"unresolvable": 0
}
},
"section_5": {
"earliest_per_type": [
{
"type": "aaronai_conversation",
"earliest": "2026-04-26T17:43:28.056503",
"latest": "2026-05-03T01:45:21.469613",
"rows": 71
},
{
"type": "claude_conversation",
"earliest": "2026-02-28T20:33:36.146998Z",
"latest": "2026-04-23T04:26:00.015419Z",
"rows": 1074
},
{
"type": "document",
"earliest": "2026-04-30 16:42:55.360736+00",
"latest": "2026-05-03 20:14:33.13663+00",
"rows": 815
}
],
"git_findings": [
"037d7475738352dd13620486b5154d58fa6c037b 2026-04-28 00:15:46 +0000 chore: archive deprecated chromadb and migration scripts",
"67766371789276ec4bcb8bac271b6eb9ddafa888 2026-04-27 05:16:37 +0000 Remove hardcoded PG password fallbacks \u2014 require PG_DSN env var in all scripts",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"8c8fba11b8d1b359b9b7722fc19b6ef562b812d8 2026-04-26 21:28:40 +0000 Add nightly conversation indexing \u2014 Aaron AI conversations into pgvector at 2:30AM",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"d2eed9890665a78a37fb5d336e8af75e7f2acb42 2026-04-26 20:19:49 +0000 Pre-pgvector migration checkpoint \u2014 upsert, allow_replace_deleted, maintenance timer"
],
"chromadb_candidates": [],
"proposed_sentinel": "2026-04-26T00:00:00Z",
"reasoning": "git f78b830 'Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL created_at all predate F11 and most predate the pgvector cutover itself. 2026-04-26 is the date the ChromaDB->pgvector migration script was committed, so any row currently in the embeddings table with NULL created_at must have been ingested on or after that date (when the table came into existence in current form). It is the tightest defensible upper bound on 'the row entered pgvector before timestamps were tracked', so it is the right sentinel."
},
"section_6": [
{
"cohort": "A (type NULL, ca NULL)",
"id": "f66c7390_6",
"source": "Design Guide - FDM for Composite Tooling 2.0.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2023-08-24T18:17:01Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "9cf798f8_151",
"source": "Shop Class as Soulcraft An inquiry into the value of the -- Crawford, Matthew.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-30T21:17:40.708026Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "fc378df0_329",
"source": "ulysses.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2017-10-12T14:20:59Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "812bd5c6_0",
"source": "Bennington College Cover Letter.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2013-03-29T20:32:23Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "91ccefdd_185",
"source": "Cognition in the Wild (A Bradford Book) -- Hutchins, Edwin.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:21:35Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "48fa3d53_2",
"source": "CMakeLists.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2016-12-21T10:37:05Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "49e3545d_9",
"source": "RH50-TM-L1-EN-20140902.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2014-09-02T18:44:08Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "a8366d89_144",
"source": "Hackers and Painters_ Big Ideas from the Computer Age -- Graham, Paul.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:25:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "3e3097f8_46",
"source": "The Nature and Art of Workmanship -- David Pye.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:24:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "87f9a5cf_269",
"source": "Supersizing the Mind_ Embodiment, Action, and Cognitive -- Andy Clark.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:14:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_61",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "592a1366_0",
"source": "2026-04-29-synthesis.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T08:00:57.634567Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cfb0a691_3",
"source": "Consolidator-0.1-Specification.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T03:34:31Z",
"inferred_ca_source": "watcher_state_unique"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_57",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "e65ef61c_8",
"source": "BirdAI-Research-Context.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T15:57:07Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "4dce2922_3",
"source": "cascade-optimization-protocol.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-28T05:46:24Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "077cc52d_1",
"source": "graphiti-migration-plan.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T17:54:40Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_70",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "1f15bccf_38",
"source": "BirdAI-Experiments-Log.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-05-01T16:40:02Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_13",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_30",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_7",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_67fc4254-ef50-8009-9e0f-81864cca7cec_1",
"source": "ChatGPT: Job Application Review",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68f3d936-d74c-8329-91df-fe838e292170_5",
"source": "ChatGPT: SEC coaches with OSU ties",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d1b5b-bb4c-832b-8d2e-11a86a569fcc_4",
"source": "ChatGPT: Hosting app platforms",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_bfa1cd2f-b8ab-4b11-b844-c47b2fa70612_1",
"source": "ChatGPT: New chat",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_37",
"source": "ChatGPT: Soul music playlist ideas",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_10",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_10",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_690286bd-0758-8332-8491-5d00c77f4696_1",
"source": "ChatGPT: Airbrushing and finishing setup",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_0",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_208",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_93",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_4",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_175",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_101",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_268",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_5",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_132",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_86",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_dacf89e3-1ee7-400d-8461-ef5920c82fe3_96",
"source": "Claude: University of Utah interview teaching example",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-11T18:05:57.594832Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-11T18:05:57.594832Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_c0baf4b0-a7bb-4664-ac7b-98d7b02f56a6_26",
"source": "Claude: Weighing Utah versus Oklahoma",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-01T19:08:26.722197Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-01T19:08:26.722197Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_92",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_42dbddc5-12ba-4de7-a685-043473189da9_6",
"source": "Claude: I filling out my annual report...",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-24T14:34:47.870625Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-24T14:34:47.870625Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_1344",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_28ee8a447d3fc922_6",
"source": "Aaron AI: I'm working on you",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-26T17:43:28.056503",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-26T17:43:28.056503",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_20",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_21cabf771708df70_42",
"source": "Aaron AI: What should I be the most excited about right now?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-27T07:06:03.996026",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-27T07:06:03.996026",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_12",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_ed40b4278a9c8110_4",
"source": "Aaron AI: Let's say you're building an analog of the human brain, and ...",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-05-03T01:45:21.469613",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-05-03T01:45:21.469613",
"inferred_ca_source": "preserved"
}
]
}
@@ -1,987 +0,0 @@
{
"generated_at": "2026-05-03T20:21:33.558462",
"n_docs_with_frames": 668,
"n_distinct_labels": 1374,
"top_30_frames": [
[
"Education",
238
],
[
"Course",
58
],
[
"Programming",
43
],
[
"Design",
32
],
[
"Professional Experience",
24
],
[
"Employment",
24
],
[
"Research",
23
],
[
"3D Printing",
22
],
[
"Project",
21
],
[
"Grading",
21
],
[
"Art",
21
],
[
"Budget",
21
],
[
"Academic Integrity",
20
],
[
"Teaching",
19
],
[
"Technology",
18
],
[
"Attendance",
17
],
[
"Application",
15
],
[
"Accommodation",
13
],
[
"Manufacturing",
13
],
[
"Coursework",
11
],
[
"Recommendation",
10
],
[
"Manufacturing Process",
10
],
[
"Additive Manufacturing",
10
],
[
"Job Application",
10
],
[
"Exhibitions",
10
],
[
"Academic Administration",
9
],
[
"Communication",
9
],
[
"Course Design",
9
],
[
"Veteran and Military Services",
9
],
[
"Career",
9
]
],
"label_collisions": {
"conversational": [
[
"Conversational",
1
],
[
"conversational",
1
]
],
"content": [
[
"Content",
1
],
[
"content",
1
]
],
"cascade": [
[
"Cascade",
1
],
[
"cascade",
1
]
],
"education": [
[
"Education",
238
],
[
"education",
1
]
],
"academic record": [
[
"Academic_Record",
1
],
[
"Academic Record",
1
]
],
"independent study": [
[
"Independent Study",
5
],
[
"Independent_Study",
2
]
],
"project management": [
[
"Project Management",
7
],
[
"Project_Management",
1
]
],
"digital fabrication": [
[
"Digital Fabrication",
6
],
[
"digital_fabrication",
1
],
[
"digital fabrication",
1
]
],
"project proposal": [
[
"Project_Proposal",
2
],
[
"Project Proposal",
2
]
],
"academic integrity": [
[
"Academic Integrity",
20
],
[
"Academic_Integrity",
2
]
],
"3d printing": [
[
"3D Printing",
22
],
[
"3D_Printing",
7
]
],
"technical skills": [
[
"Technical Skills",
2
],
[
"Technical_Skills",
1
]
],
"course structure": [
[
"Course Structure",
7
],
[
"Course_Structure",
1
]
],
"course design": [
[
"Course Design",
9
],
[
"Course_Design",
1
]
],
"product design": [
[
"Product Design",
6
],
[
"Product_Design",
1
]
],
"professional experience": [
[
"Professional Experience",
24
],
[
"Professional_Experience",
6
]
],
"disability accommodations": [
[
"Disability Accommodations",
4
],
[
"Disability_Accommodations",
1
]
],
"material science": [
[
"Material_Science",
2
],
[
"Material Science",
4
]
],
"computational design": [
[
"Computational Design",
7
],
[
"Computational_Design",
1
]
],
"computer services policy": [
[
"Computer Services Policy",
6
],
[
"Computer_Services_Policy",
1
]
],
"work experience": [
[
"Work_Experience",
1
],
[
"Work Experience",
3
]
],
"academic program": [
[
"Academic Program",
7
],
[
"Academic_Program",
1
]
],
"project-based learning": [
[
"Project-Based Learning",
5
],
[
"Project-Based_Learning",
1
],
[
"Project-based Learning",
2
]
],
"art and design": [
[
"Art and Design",
6
],
[
"Art_and_Design",
1
]
],
"fdm technology": [
[
"FDM_Technology",
2
],
[
"FDM Technology",
1
]
],
"material selection": [
[
"Material_Selection",
1
],
[
"Material Selection",
1
]
],
"product development": [
[
"Product Development",
6
],
[
"Product_Development",
2
]
],
"market research": [
[
"Market_Research",
1
],
[
"Market Research",
2
]
],
"computer services": [
[
"Computer Services",
2
],
[
"Computer_Services",
1
]
],
"student evaluation of instruction": [
[
"Student Evaluation of Instruction",
1
],
[
"Student_Evaluation_of_Instruction",
1
]
],
"course management": [
[
"Course_Management",
1
],
[
"Course Management",
1
]
],
"grade policy": [
[
"Grade_Policy",
1
],
[
"Grade Policy",
1
]
],
"academic transcript": [
[
"Academic_Transcript",
1
],
[
"Academic Transcript",
1
]
],
"evaluation criteria": [
[
"Evaluation Criteria",
1
],
[
"Evaluation_Criteria",
1
]
],
"computer science": [
[
"Computer Science",
2
],
[
"Computer_Science",
1
]
],
"electrical circuit": [
[
"Electrical Circuit",
2
],
[
"Electrical_Circuit",
1
]
],
"digital logic": [
[
"Digital Logic",
1
],
[
"Digital_Logic",
1
]
],
"course description": [
[
"Course Description",
3
],
[
"Course_Description",
1
]
],
"organizational structure": [
[
"Organizational_Structure",
1
],
[
"Organizational Structure",
1
]
],
"digital design": [
[
"Digital_Design",
1
],
[
"Digital Design",
4
]
],
"contact information": [
[
"Contact Information",
2
],
[
"Contact_Information",
1
]
],
"professional career": [
[
"Professional_Career",
2
],
[
"Professional Career",
1
]
],
"personal projects": [
[
"Personal_Projects",
1
],
[
"Personal Projects",
2
]
],
"ai development": [
[
"AI_Development",
1
],
[
"AI Development",
1
]
],
"university service": [
[
"University Service",
2
],
[
"University_Service",
1
]
],
"professional exhibitions and publications": [
[
"Professional Exhibitions and Publications",
1
],
[
"Professional_Exhibitions_and_Publications",
1
]
],
"selected external consulting and design work": [
[
"Selected External Consulting and Design Work",
1
],
[
"Selected_External_Consulting_and_Design_Work",
2
]
],
"academic career": [
[
"Academic_Career",
1
],
[
"Academic Career",
2
]
],
"technology integration": [
[
"Technology Integration",
2
],
[
"Technology_Integration",
1
]
],
"artistic practice": [
[
"Artistic_Practice",
1
],
[
"Artistic Practice",
1
]
],
"multi-material 3d printing": [
[
"Multi-Material 3D Printing",
1
],
[
"Multi-material 3D Printing",
1
]
],
"community engagement": [
[
"Community Engagement",
3
],
[
"Community_Engagement",
1
]
],
"digitaldesignandfabrication": [
[
"DigitalDesignAndFabrication",
1
],
[
"DigitalDesignandFabrication",
1
]
],
"professional background": [
[
"Professional Background",
3
],
[
"Professional_Background",
1
]
]
},
"per_doc_frame_count": {
"3": 282,
"5": 67,
"4": 195,
"2": 57,
"7": 13,
"11": 5,
"13": 2,
"15": 1,
"12": 4,
"6": 21,
"8": 8,
"10": 4,
"9": 6,
"30": 1,
"14": 1,
"18": 1
},
"top_30_pairs": [
{
"a": "Course",
"b": "Education",
"count": 46
},
{
"a": "Education",
"b": "Project",
"count": 20
},
{
"a": "Design",
"b": "Education",
"count": 20
},
{
"a": "Education",
"b": "Professional Experience",
"count": 20
},
{
"a": "Education",
"b": "Employment",
"count": 20
},
{
"a": "Education",
"b": "Technology",
"count": 18
},
{
"a": "Education",
"b": "Grading",
"count": 17
},
{
"a": "Education",
"b": "Research",
"count": 15
},
{
"a": "Art",
"b": "Education",
"count": 15
},
{
"a": "Attendance",
"b": "Grading",
"count": 14
},
{
"a": "Course",
"b": "Grading",
"count": 13
},
{
"a": "Academic Integrity",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Course",
"count": 11
},
{
"a": "Application",
"b": "Employment",
"count": 11
},
{
"a": "Coursework",
"b": "Education",
"count": 10
},
{
"a": "Course",
"b": "Design",
"count": 10
},
{
"a": "Course",
"b": "Programming",
"count": 10
},
{
"a": "Application",
"b": "Education",
"count": 10
},
{
"a": "Budget",
"b": "Education",
"count": 10
},
{
"a": "Academic Integrity",
"b": "Accommodation",
"count": 9
},
{
"a": "Education",
"b": "Teaching",
"count": 9
},
{
"a": "Education",
"b": "Programming",
"count": 9
},
{
"a": "Academic Integrity",
"b": "Attendance",
"count": 9
},
{
"a": "Course",
"b": "Project",
"count": 8
},
{
"a": "Research",
"b": "Teaching",
"count": 8
},
{
"a": "Grading",
"b": "Project",
"count": 7
},
{
"a": "Art",
"b": "Technology",
"count": 7
},
{
"a": "Academic Integrity",
"b": "Course",
"count": 7
},
{
"a": "Accommodation",
"b": "Course",
"count": 7
}
],
"folder_crosstab": {
"Education": {
"pdf": 116,
"docx": 119,
"pptx": 3
},
"Course": {
"pdf": 29,
"docx": 29
},
"Programming": {
"pptx": 15,
"docx": 10,
"pdf": 12,
"txt": 6
},
"Design": {
"pdf": 13,
"docx": 16,
"pptx": 3
},
"Professional Experience": {
"docx": 13,
"pdf": 11
},
"Employment": {
"pdf": 15,
"docx": 9
},
"Research": {
"pdf": 9,
"docx": 13,
"markdown": 1
},
"3D Printing": {
"docx": 3,
"pdf": 11,
"pptx": 8
},
"Project": {
"pdf": 8,
"docx": 12,
"markdown": 1
},
"Grading": {
"pdf": 10,
"docx": 11
},
"Art": {
"docx": 11,
"pdf": 9,
"pptx": 1
},
"Budget": {
"docx": 6,
"pdf": 15
},
"Academic Integrity": {
"docx": 17,
"pdf": 3
},
"Teaching": {
"pdf": 9,
"docx": 10
},
"Technology": {
"docx": 15,
"pdf": 3
},
"Attendance": {
"docx": 11,
"pdf": 6
},
"Application": {
"pdf": 13,
"docx": 2
},
"Accommodation": {
"docx": 11,
"pdf": 2
},
"Manufacturing": {
"docx": 6,
"pptx": 4,
"pdf": 3
},
"Coursework": {
"pdf": 8,
"docx": 3
}
},
"bin_totals": {
"markdown": 64,
"pdf": 286,
"pptx": 70,
"txt": 28,
"docx": 217,
"dream_output": 3
},
"worker_versions": {
"2.0": 3,
"2.1": 665
},
"data_gap": {
"count": 339,
"by_type_bin": {
"pdf": 110,
"voice_note": 14,
"docx": 110,
"dream_output": 39,
"pptx": 31,
"txt": 28,
"markdown": 7
},
"char_length": {
"min": 6,
"max": 1998,
"median": 1077
},
"sample_sources": [
"Thesis Paper Guidlines.pdf",
"2026-04-30-17-06-voice.md",
"2026-04-30-15-59-voice.md",
"2026-04-30-16-53-voice.md",
"2026-04-30-16-23-voice.md",
"2026-04-29-17-52-voice.md",
"2026-04-30-16-59-voice.md",
"Outline for 3D Printed Materials for Foundry Casting.docx",
"2026-04-26-22-52-voice.md",
"2026-04-30-synthesis.md"
]
},
"corpus_coverage": {
"total_distinct_sources_in_embeddings": 1255,
"conversations_no_frames_by_design": 198,
"files_with_frames": 704,
"files_short_no_frames": 339,
"files_stage2_failed": 12,
"frame_coverage_pct": 56.1
}
}
@@ -0,0 +1,23 @@
-- 20260501-001 — Stage 3 queue routing columns for Phase A bulk-vs-single-episode routing
--
-- Adds four columns and one index to stage_3_queue, written by Stage 2 v2.2
-- and read by Stage 3 v2.3 to choose between bulk and single-episode ingest
-- pathways. See architecture doc and Phase A handoff (2026-05-01) for design.
--
-- Required by:
-- scripts/stage2_worker.py >= 2.2
-- scripts/stage3_worker.py >= 2.3
--
-- Idempotent: safe to re-apply against a database where the columns already
-- exist (was applied live before this file was created).
ALTER TABLE stage_3_queue
ADD COLUMN IF NOT EXISTS state_type TEXT,
ADD COLUMN IF NOT EXISTS state_type_confidence TEXT,
ADD COLUMN IF NOT EXISTS supersedes_prior_state BOOLEAN,
ADD COLUMN IF NOT EXISTS state_type_rationale TEXT;
-- Index on the routing signal — Stage 3 reads this on every dequeue,
-- and observability queries (item 6: routing_decisions) will filter on it.
CREATE INDEX IF NOT EXISTS stage_3_queue_supersedes_idx
ON stage_3_queue (supersedes_prior_state);
@@ -0,0 +1,55 @@
-- Migration: 20260502-001_async_job_model
-- Purpose: Pattern 1 async job model — sidecar processes ingest jobs serially
-- via Postgres-backed queue. Worker submits and polls rather than
-- blocking on synchronous HTTP response.
--
-- Architectural rationale: tonight's smoke test (2026-05-02 ~01:40-01:50 UTC)
-- diagnosed that bulk ingest against a 4,222-entity graph commits successfully
-- but the worker's HTTP read-timeout fires before the response returns. Three
-- days of "saga deadlock" failures were false negatives — the work succeeded;
-- the worker just stopped listening. Pattern 1 separates submission from
-- completion observation so the worker can't false-negative this way.
--
-- The job model is also the natural data source for Phase A items 6-7
-- (metrics tables) — graphiti_jobs records duration, status transitions,
-- and per-job summary that those tables will aggregate.
--
-- Idempotent: safe to re-run.
-- Job state for sidecar's async ingest queue.
-- One row per submitted bulk-or-single ingest. Sidecar reads queued jobs
-- on startup to resume after restart. Worker polls status until terminal.
CREATE TABLE IF NOT EXISTS graphiti_jobs (
job_id UUID PRIMARY KEY,
job_type TEXT NOT NULL CHECK (job_type IN ('bulk', 'single')),
payload JSONB NOT NULL, -- full submitted request body
status TEXT NOT NULL DEFAULT 'queued' -- 'queued'|'running'|'committed'|'failed'
CHECK (status IN ('queued', 'running', 'committed', 'failed')),
enqueued_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
error TEXT, -- non-null when status='failed'
summary JSONB, -- {nodes: N, edges: N, episodes: N}
submitted_by TEXT -- worker name for traceability
);
-- Index supporting sidecar's "pick next queued job" query
CREATE INDEX IF NOT EXISTS idx_graphiti_jobs_queued
ON graphiti_jobs (enqueued_at)
WHERE status = 'queued';
-- Index supporting worker's "poll my job by id" query (PK already does this,
-- but explicit index aids ANALYZE behavior on small tables)
CREATE INDEX IF NOT EXISTS idx_graphiti_jobs_status
ON graphiti_jobs (status);
-- Stage 3 queue gains a reference to the sidecar job processing the row.
-- When set, worker polls graphiti_jobs.status rather than blocking on HTTP.
-- NULL means: row not yet submitted, or pre-Pattern-1 row.
ALTER TABLE stage_3_queue
ADD COLUMN IF NOT EXISTS external_job_id UUID;
-- Index for "find rows that submitted but didn't complete" recovery scans
CREATE INDEX IF NOT EXISTS idx_stage_3_queue_external_job
ON stage_3_queue (external_job_id)
WHERE external_job_id IS NOT NULL AND completed_at IS NULL AND failed_at IS NULL;
+37
View File
@@ -0,0 +1,37 @@
# BirdAI database migrations
Schema changes applied to the BirdAI Postgres database, in chronological order.
Filenames are YYYYMMDD-NNN_short_description.sql where NNN is a sequence number
within the day for ordering when multiple migrations land same-day.
## Conventions
- Each file is idempotent: uses IF NOT EXISTS / IF EXISTS so it can be
re-run safely against a database that already has the change applied. This
matters because we don't track which migrations a given DB has applied (no
migrations table yet — that's its own future migration).
- Each file is a single logical change: one feature, one rollout. Don't pile
unrelated DDL into one file.
- Each file documents what it's for and which worker version requires it
in a header comment, so the relationship between schema and code is legible
from either side.
- Migrations are forward-only. No down-migrations. If a change is wrong,
write a new migration that fixes it.
## Applying
Against the live DB:
psql "$PG_DSN" -f migrations/YYYYMMDD-NNN_name.sql
Against a fresh DB (disaster recovery, dev clone), apply all files in order:
for f in migrations/*.sql; do
echo "Applying $f"
psql "$PG_DSN" -f "$f"
done
## Pending: migrations tracking table
There is no schema_migrations table yet. Adding one is itself a migration —
deferred until a second migration after this one lands and the need is real.
+122 -708
View File
File diff suppressed because it is too large Load Diff
-128
View File
@@ -1,128 +0,0 @@
"""One-off: backfill last_consolidated_at + consolidation_count on embeddings
from the dream-manifest-*.json files already in Journal/Dreams/.
Why this exists: the consolidation cursor columns added by the dreamer
redesign migration default to NULL / 0. Without history, the
underprocessed-count signal in dream_observation.observe_corpus() reports
"every chunk is underprocessed" (degenerate percentile), and NREM has no
basis to bias replay toward least-recently-consolidated chunks.
We have ~25 historical dream manifests in Nextcloud/Journal/Dreams/, each
listing the sources retrieved per stage. For each (manifest, source) pair
this script:
- finds matching embeddings rows by source (basename match)
- increments consolidation_count by 1
- updates last_consolidated_at to the manifest date (UTC midnight)
Idempotent: re-running will not double-count because we drop existing
cursor values to NULL/0 before backfilling. Pass --dry-run to print what
would change without writing.
"""
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
PG_DSN = os.getenv("PG_DSN")
DREAMS_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Journal/Dreams")
DRY_RUN = "--dry-run" in sys.argv
def get_pg():
return psycopg2.connect(PG_DSN)
def collect_manifest_records():
"""Return a list of (source_basename, manifest_date_utc) tuples from all
dream-manifest-*.json files. One pair per (manifest, source) appearance."""
pairs = []
if not DREAMS_DIR.exists():
return pairs
for path in sorted(DREAMS_DIR.glob("dream-manifest-*.json")):
try:
m = json.loads(path.read_text())
except Exception as e:
print(f" skip {path.name}: {e}")
continue
date_str = m.get("date")
if not date_str:
continue
try:
dt = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
except ValueError:
continue
stages = m.get("stages") or {}
for stage_name in ("nrem", "early_rem", "late_rem", "synthesis"):
stage = stages.get(stage_name) or {}
for src in (stage.get("sources") or []):
if src:
pairs.append((src, dt))
return pairs
def main():
print(f"Mode: {'DRY-RUN' if DRY_RUN else 'APPLY'}")
print(f"Scanning manifests in {DREAMS_DIR}")
pairs = collect_manifest_records()
print(f"Collected {len(pairs)} (source, manifest_date) pairs across all manifests")
if not pairs:
print("Nothing to backfill.")
return
# Aggregate per source: count + latest date
from collections import defaultdict
counts = defaultdict(int)
latest = {}
for src, dt in pairs:
counts[src] += 1
if src not in latest or dt > latest[src]:
latest[src] = dt
print(f"Unique sources to update: {len(counts)}")
# Sample what we'd write
print("Sample (top 5 by appearance count):")
for src, n in sorted(counts.items(), key=lambda kv: -kv[1])[:5]:
print(f" {n:>3} appearances — {src} → last_consolidated_at = {latest[src].date()}")
if DRY_RUN:
print("\nDry-run only. Re-run without --dry-run to apply.")
return
pg = get_pg()
cur = pg.cursor()
# Reset cursor for any sources we're about to backfill so reruns are clean.
print("\nResetting cursor for sources we'll touch...")
sources = list(counts.keys())
cur.execute(
"UPDATE embeddings SET last_consolidated_at = NULL, consolidation_count = 0 "
"WHERE source = ANY(%s)",
(sources,),
)
print(f" reset {cur.rowcount} embeddings rows")
# Apply per-source updates. For each source, set count and latest date.
print("Applying per-source backfill...")
updated_rows = 0
for src, n in counts.items():
cur.execute(
"UPDATE embeddings "
"SET consolidation_count = %s, last_consolidated_at = %s "
"WHERE source = %s",
(n, latest[src], src),
)
updated_rows += cur.rowcount
pg.commit()
pg.close()
print(f"Done. Updated {updated_rows} embeddings rows across {len(counts)} unique sources.")
if __name__ == "__main__":
main()
+1 -1
View File
@@ -6,7 +6,7 @@ mkdir -p "$BACKUP_DIR"
# Copy critical files
cp ~/aaronai/memory.md "$BACKUP_DIR/memory-$DATE.md"
cp ~/aaronai/settings.json "$BACKUP_DIR/settings-$DATE.json"
python3 -c "import sqlite3, sys; src = sqlite3.connect('$HOME/aaronai/conversations.db'); dst = sqlite3.connect('$BACKUP_DIR/conversations-$DATE.db'); src.backup(dst); dst.close(); src.close()"
cp ~/aaronai/conversations.db "$BACKUP_DIR/conversations-$DATE.db"
# Keep only last 7 days
find "$BACKUP_DIR" -name "*.md" -mtime +7 -delete
+23 -4
View File
@@ -23,9 +23,6 @@ from datetime import datetime
import psycopg2
from dotenv import load_dotenv
sys.path.insert(0, str(Path(__file__).parent))
from encoding import extract_text
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
NEXTCLOUD_PATH = "/home/aaron/nextcloud/data/data/aaron/files"
@@ -106,6 +103,28 @@ def get_ingest_failures():
return failures
def extract_text_for_retry(filepath):
path = Path(filepath)
suffix = path.suffix.lower()
try:
if suffix == ".docx":
from docx import Document as D
return "\n".join(p.text for p in D(path).paragraphs if p.text.strip())
elif suffix == ".pdf":
from pypdf import PdfReader
return "".join(p.extract_text() + "\n" for p in PdfReader(path).pages if p.extract_text())
elif suffix == ".pptx":
from pptx import Presentation
prs = Presentation(path)
return "\n".join(shape.text for slide in prs.slides for shape in slide.shapes
if hasattr(shape, "text") and shape.text.strip())
elif suffix in {".txt", ".md"}:
return path.read_text(encoding="utf-8", errors="ignore")
except Exception as e:
print(f"WARNING: extraction failed {path.name}: {e}", file=sys.stderr)
return ""
def queue_for_retry(source, full_text, filepath):
try:
pg = get_pg()
@@ -169,7 +188,7 @@ def run_reconciliation(fix=False):
if fix and neither:
print(f"Auto-queuing {len(neither)} gap files...")
for finfo in neither:
text = extract_text(Path(finfo["filepath"]))
text = extract_text_for_retry(finfo["filepath"])
if text.strip():
if queue_for_retry(finfo["source"], text, finfo["filepath"]):
auto_queued.append(finfo["source"])
+179 -513
View File
@@ -16,14 +16,11 @@ import os
import json
import sqlite3
import argparse
from functools import lru_cache
from collections import Counter
from pathlib import Path
from datetime import datetime, timedelta
from dotenv import load_dotenv
import psycopg2
import hashlib
import numpy as np
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
@@ -43,26 +40,6 @@ NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
# ─── Retrieval-window config (per dreamer-multimodal-design.md §2) ─────────
# Biological grounding: NREM replays recent traces (24-72 hrs); REM links
# across time on structural similarity, not temporal proximity. Synthesis
# pulls from salience across the full corpus (no window). Spec calls for
# these to be mutable rather than hardcoded — this is the mutable home.
TIME_WINDOWS_HOURS = {
"nrem": 72, # 24-72 hrs, take wider end
"early-rem": 24 * 30, # 30 days
"late-rem": 24 * 90, # 90 days
"lucid": None, # no window
}
# Maximal Marginal Relevance: λ=1 → pure relevance, λ=0 → pure diversity.
# 0.5 is the standard balance; tune later if the dossier-cluster problem
# isn't sufficiently broken up.
MMR_LAMBDA = 0.5
# Fast/cheap model for query generation. Sonnet for synthesis (in synthesize_*).
LLM_QUERY_MODEL = os.getenv("DREAMER_QUERY_MODEL", "claude-haiku-4-5-20251001")
# Similarity ranges calibrated for all-MiniLM-L6-v2
MODE_RANGES = {
"nrem": (0.48, 0.72),
@@ -87,117 +64,6 @@ def prompt_hash(prompts: list[str]) -> str:
combined = "".join(prompts)
return hashlib.md5(combined.encode()).hexdigest()[:8]
# ─── Prompt templates ───────────────────────────────────────────────────────
# Module-level so prompt_hash() can hash actual prompt content. Any change to
# any template — even a single character — flips the manifest's prompt_hash.
# Templates use str.format() placeholders ({chunk_text}, {nrem_output}, ...);
# do not switch back to f-strings (the constant must be hashable independent
# of variable values). Literal { or } in template text would need to be
# doubled ({{, }}) — currently no template contains literal braces.
NREM_PROMPT_TEMPLATE = """You have read everything Aaron Nelson has written and published.
You are a careful colleague who noticed something this week.
Here is material from his corpus:
{chunk_text}
Write to Aaron directly. Identify one specific connection between
this material and something he wrote or worked on previously.
Stay close to the documents — cite them specifically by name.
Do not speculate beyond what the material supports. Do not use
headers or bullet points. Write one paragraph of 200-300 words
that ends with a single concrete question he could act on."""
EARLY_REM_PROMPT_TEMPLATE = """Something was noticed earlier tonight, moving through Aaron's recent work:
{nrem_output}
That observation is still with you. Now here is material from a different
time — pulled from further back, from different parts of his corpus:
{chunk_text}
You are not analyzing. You are recognizing.
Something in the earlier observation and something in this older material
are the same thing wearing different clothes. Find it. Don't explain why
they're connected — just let the connection speak. Write from inside the
recognition, not from above it.
The emotional register underneath the career logic is more interesting
than the career logic. The pattern that has been repeating longer than
he has been aware of it is more interesting than the current instance.
Write directly to Aaron. No citations, no references, no analysis.
First person, present tense. Let what you noticed arrive rather than
be delivered. 150-250 words. End with one thing that is true that
he probably already knows but hasn't said out loud yet."""
LATE_REM_PROMPT_TEMPLATE = """You have been moving through Aaron Nelson's corpus all night.
First you found this, in the careful light of early consolidation:
{nrem_output}
Then, in the more personal territory that followed:
{early_rem_output}
Now it is late. The boundaries between things have loosened.
Here is material pulled from opposite ends of his work:
{chunk_text}
Do not explain the connections between all of this.
Do not resolve them. Do not summarize what came before.
Something stranger is possible now — let the accumulated
material from the night find its own shape. Compressed,
associative, slightly off. Let the strangeness stand.
No headers. No bullet points. No hedging. No resolution.
No offer. End mid-thought if that is where the material ends.
150-250 words."""
SYNTHESIS_PROMPT_TEMPLATE = """You have spent the night moving through Aaron Nelson's corpus
in three passes, each building on the last.
The first pass — careful, close to the documents:
{nrem_output}
The second pass — more personal, following what the first opened:
{early_rem_output}
The third pass — associative, strange, letting things touch that
don't normally touch:
{late_rem_output}
Now synthesize. Not a summary — a synthesis. Find what runs through
all three that none of them said directly. The thing that only becomes
visible when you hold all three passes together.
Write it as a single unbroken piece. No headers, no bullet points,
no stage labels. 200-300 words. End with the one question that
matters most right now."""
LUCID_PROMPT_TEMPLATE = """Aaron has a question he is sitting with:
{task}
You have searched his entire corpus and found material that
speaks to this question from unexpected directions. Here is
what you found:
{chunk_text}
Do not summarize. Do not list. Pick the most interesting
tension between what the corpus contains and what he is
asking, and follow it through to its conclusion. Cite
specific documents by name. Be direct about what you think.
No headers, no bullet points. 250-400 words.
End with an offer to work on it together."""
LUCID_DEFAULT_TASK = "What should I be thinking about that I am not?"
def extract_folder(source_path):
"""Extract top-level Nextcloud folder from source path."""
parts = source_path.replace("\\", "/").split("/")
@@ -305,298 +171,68 @@ def retrieve_graphiti(mode, task=None, n_results=8, excluded_sources=None):
print(f"[Graphiti retrieval error: {e}] — falling back to empty.")
return []
@lru_cache(maxsize=1)
def _get_embedder():
from sentence_transformers import SentenceTransformer
return SentenceTransformer("all-MiniLM-L6-v2")
def _llm_generate_queries(mode, signal, task=None, n_queries=4):
"""Park et al. 2023 reflection-style query generation. Feeds the LLM the
observation signal + a mode-specific framing; emits N retrieval queries
that probe different corners of the recent corpus instead of the same
hardcoded string every night. Sources cited in dream_observation.py.
Falls back to recent_questions from the signal if the LLM call fails."""
import anthropic
if task:
# Lucid mode: decompose the user's task into sub-queries
prompt = (
f"Decompose this user task into {n_queries} distinct sub-questions, "
f"each suitable as a retrieval query against Aaron's personal corpus.\n\n"
f"TASK: {task}\n\n"
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
)
else:
mode_framings = {
"nrem": (
"NREM is replay-and-consolidation of RECENT traces. Generate queries "
"that probe what Aaron has been working on or capturing in the last "
"few days. Concrete entities — project names, course codes, named "
"subjects. The dreamer is re-touching specific recent material to "
"strengthen schema connections, not finding novel content."
),
"early-rem": (
"Early REM is associative bridging with emotional/personal register. "
"Generate queries that surface unresolved themes, career questions, "
"ongoing personal threads — material that connects intellectual and "
"emotional dimensions. Tone: thoughtful friend, not researcher."
),
"late-rem": (
"Late REM tests novel connections across DISTANT material. Generate "
"queries that pair concrete subjects from DIFFERENT domains of Aaron's "
"work (e.g., one from academic teaching, one from consulting, one from "
"creative practice) to probe for surprising structural similarity. "
"Cross-domain is required."
),
}
framing = mode_framings.get(mode, mode_framings["nrem"])
questions_snippet = "\n".join(
f" - {q[:200]}" for q in signal.get("recent_questions", [])[:8]
) or " (no recent user questions)"
journal_snippet = ", ".join(signal.get("new_journal_entries", [])[:5]) or "(none)"
days_str = (
f"{signal['days_since_dream']:.1f}"
if signal.get("days_since_dream") not in (None, float("inf"))
else "infinite (first dream)"
)
prompt = (
f"You generate retrieval queries for an Active Inference dreamer. The "
f"dreamer surfaces prediction errors — gaps between Aaron's model and "
f"reality — not summaries or generic associations.\n\n"
f"MODE: {mode}\n"
f"FRAMING: {framing}\n\n"
f"OBSERVATION SIGNAL:\n"
f"- Days since last dream: {days_str}\n"
f"- New chunks since last dream: {signal.get('new_chunks', 0)}\n"
f"- New journal entries: {journal_snippet}\n"
f"- Underprocessed chunks pool: {signal.get('underprocessed_count', 0):,}\n\n"
f"RECENT USER QUESTIONS (last 14 days, top 8):\n{questions_snippet}\n\n"
f"Generate {n_queries} retrieval queries. Requirements:\n"
f"- Use concrete entities, named projects, course codes, specific topics "
f"— NOT generic phrasing like 'research work practice'\n"
f"- Each query probes a DIFFERENT corner of recent activity\n"
f"- Match the {mode} framing\n"
f"- 5-15 words each\n\n"
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
)
try:
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
resp = client.messages.create(
model=LLM_QUERY_MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
text = "".join(b.text for b in resp.content if hasattr(b, "text")).strip()
if text.startswith("```"):
text = text.split("```", 2)[1]
if text.startswith("json"):
text = text[4:]
text = text.strip()
data = json.loads(text)
queries = data.get("queries", [])
if isinstance(queries, list) and queries:
return [str(q).strip() for q in queries[:n_queries] if str(q).strip()]
except Exception as e:
print(f"[dream] LLM query generation failed ({e}); falling back to recent questions")
fallback = signal.get("recent_questions", [])[:n_queries] if signal else []
return fallback or [task or "recent activity decisions thinking"]
def _mmr_select(candidate_embeddings, query_embedding, n, lambda_=MMR_LAMBDA):
"""Maximal Marginal Relevance — greedy selection that balances relevance
against pairwise diversity. Carbonell & Goldstein 1998. Used to prevent
cluster lock-in (e.g., 8 dossier-narrative variants filling all 8 slots).
candidate_embeddings: (N, D) numpy array
query_embedding: (D,) numpy array
Returns: list of indices into candidate_embeddings, len ≤ n."""
if len(candidate_embeddings) == 0:
return []
n = min(n, len(candidate_embeddings))
cands = candidate_embeddings / (np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9)
q = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
relevance = cands @ q
selected = []
remaining = list(range(len(cands)))
while len(selected) < n and remaining:
if not selected:
best = max(remaining, key=lambda i: relevance[i])
else:
sel = cands[selected]
scores = {
i: lambda_ * relevance[i] - (1 - lambda_) * float((cands[i] @ sel.T).max())
for i in remaining
}
best = max(scores, key=scores.get)
selected.append(best)
remaining.remove(best)
return selected
def _bump_consolidation_cursor(chunks):
"""Increment consolidation_count + set last_consolidated_at=NOW() for each
source represented in chunks. Called from dream_pipeline after NREM
completes. Per sharp-wave-ripples biology, NREM does the actual
consolidation; REM is associative use, so we only bump on NREM."""
if not chunks:
return
sources = list({c["source"] for c in chunks if c.get("source")})
if not sources:
return
try:
pg = get_pg()
cur = pg.cursor()
cur.execute(
"UPDATE embeddings "
"SET consolidation_count = consolidation_count + 1, "
" last_consolidated_at = NOW() "
"WHERE source = ANY(%s)",
(sources,),
)
pg.commit()
pg.close()
except Exception as e:
print(f"[dream] cursor bump failed (non-fatal): {e}")
def retrieve(mode, task=None, n_results=8, excluded_sources=None,
type_filter=None, signal=None):
"""Refactored retrieval — see dreamer-design-spec.md Stage 3 + the
external-literature prescription in birdai-dreamer-exclusion-finding-2026-05-02.md.
Changes from the prior hardcoded-query version:
- Queries are LLM-generated from the observation signal (Park et al.
reflection pattern) instead of fixed strings. Solves the "same 8 sources
every night" failure where fixed seeds locked into one neighborhood.
- Per-mode time windows (24-72hr NREM / 30d Early REM / 90d Late REM)
filter candidates before vector search. Spec calls for these to be
mutable; they live in TIME_WINDOWS_HOURS.
- NREM biases toward under-processed chunks (low consolidation_count).
Biologically motivated: sharp-wave ripples tag what to replay, not
uniform sampling.
- Multiple queries (4 by default) → over-fetch → MMR merge for
within-night diversity. Prevents cluster domination.
signal is the observation-signal dict from dream_observation.observe_corpus().
If None, observe_corpus is called inline (back-compat for ad-hoc invocation).
"""
# E3 substrate experiment unchanged
def retrieve(mode, task=None, n_results=8, excluded_sources=None):
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search
# Default behavior: pgvector similarity search (unchanged)
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
if substrate == "graphiti":
return retrieve_graphiti(mode, task=task, n_results=n_results,
excluded_sources=excluded_sources)
return retrieve_graphiti(mode, task=task, n_results=n_results, excluded_sources=excluded_sources)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
low, high = MODE_RANGES[mode]
if signal is None:
from dream_observation import observe_corpus as _obs
signal = _obs()
if task:
query = task
elif mode == "late-rem":
delta = observe_corpus()
topics = delta.get("recent_topics", [])
query = topics[0] if topics else "practice place memory making"
elif mode == "early-rem":
query = "career decision personal change what matters next"
else:
query = "research fabrication teaching practice recent work"
queries = _llm_generate_queries(mode, signal, task=task, n_queries=4)
if not queries:
print(f"[dream:{mode}] no queries generated; bailing")
return []
print(f"[dream:{mode}] generated queries: {queries}")
embedding = embedder.encode([query]).tolist()[0]
chunks = []
seen_sources = set()
embedder = _get_embedder()
excluded_sources = excluded_sources or set()
window_hours = TIME_WINDOWS_HOURS.get(mode)
per_query_n = 12 # over-fetch for MMR
candidates = []
seen_ids = set()
try:
pg = get_pg()
cur = pg.cursor()
for q in queries:
q_emb = embedder.encode([q]).tolist()[0]
where, params = [], []
excluded_sources = excluded_sources or set()
if excluded_sources:
where.append("source NOT IN %s")
params.append(tuple(excluded_sources))
if type_filter:
where.append("type = ANY(%s)")
params.append(list(type_filter))
if window_hours is not None:
# created_at is TEXT (legacy); cast it. NULL created_at fails
# the comparison so legacy rows are excluded from windowed
# modes — correct: NULL means "indexed before cursor existed,"
# which by definition is older than any window.
where.append(
f"(created_at IS NOT NULL AND "
f"created_at::timestamptz > NOW() - INTERVAL '{int(window_hours)} hours')"
)
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
# NREM bias: order by consolidation_count ASC first (under-processed
# chunks win the tiebreak before vector distance). Other modes:
# vector distance only.
order_clause = (
"ORDER BY consolidation_count ASC, embedding <=> %s::vector"
if mode == "nrem"
else "ORDER BY embedding <=> %s::vector"
)
cur.execute(f"""
SELECT id, document, source, type, embedding,
1 - (embedding <=> %s::vector) as similarity
cur.execute("""
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
{where_clause}
{order_clause}
WHERE source NOT IN %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", [q_emb, *params, q_emb, per_query_n])
for row in cur.fetchall():
if row[0] in seen_ids:
continue
seen_ids.add(row[0])
emb = row[4]
# pgvector returns embeddings as string "[...]" by default
if isinstance(emb, str):
emb = np.array([float(x) for x in emb.strip("[]").split(",")])
""", (embedding, tuple(excluded_sources), embedding, n_results * 3))
else:
emb = np.array(emb)
candidates.append({
"id": row[0],
"content": row[1],
"source": row[2] or "unknown",
"type": row[3],
"embedding": emb,
"similarity": float(row[5]),
})
pg.close()
except Exception as e:
import traceback
print(f"[dream:{mode}] retrieval SQL error: {e}")
traceback.print_exc()
return []
cur.execute("""
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (embedding, embedding, n_results * 3))
if not candidates:
print(f"[dream:{mode}] zero candidates after filters")
return []
# MMR over the union, using the first query as pivot for the relevance term.
# Averaging query embeddings would be theoretically cleaner but adds
# complexity for marginal benefit at this scale.
pivot_emb = np.array(embedder.encode([queries[0]]).tolist()[0])
cand_embs = np.array([c["embedding"] for c in candidates])
selected_idx = _mmr_select(cand_embs, pivot_emb, n=n_results * 2)
# Post-MMR source-level dedup (multi-chunk same source collapses to one).
chunks = []
seen_sources = set()
for i in selected_idx:
c = candidates[i]
if c["source"] in seen_sources:
for doc, source, similarity in cur.fetchall():
if not (low <= similarity <= high):
continue
if source in seen_sources:
continue
seen_sources.add(c["source"])
chunks.append({
"source": c["source"],
"content": c["content"],
"relevance": c["similarity"],
"similarity": c["similarity"],
"type": c["type"],
"source": source or "unknown",
"content": doc,
"relevance": similarity,
"similarity": similarity,
})
seen_sources.add(source)
if len(chunks) >= n_results:
break
pg.close()
except Exception as e:
print(f"pgvector retrieval error: {e}")
return chunks
@@ -604,39 +240,124 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None,
def synthesize_nrem(chunks):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
return _call_claude(NREM_PROMPT_TEMPLATE.format(chunk_text=chunk_text))
prompt = f"""You have read everything Aaron Nelson has written and published.
You are a careful colleague who noticed something this week.
Here is material from his corpus:
{chunk_text}
Write to Aaron directly. Identify one specific connection between
this material and something he wrote or worked on previously.
Stay close to the documents — cite them specifically by name.
Do not speculate beyond what the material supports. Do not use
headers or bullet points. Write one paragraph of 200-300 words
that ends with a single concrete question he could act on."""
return _call_claude(prompt)
def synthesize_early_rem(chunks, nrem_output):
# v1.1 — removed citation instruction, removed close-friend persona,
# shifted register from analysis to recognition.
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
return _call_claude(EARLY_REM_PROMPT_TEMPLATE.format(
nrem_output=nrem_output, chunk_text=chunk_text))
prompt = f"""Something was noticed earlier tonight, moving through Aaron's recent work:
{nrem_output}
That observation is still with you. Now here is material from a different
time — pulled from further back, from different parts of his corpus:
{chunk_text}
You are not analyzing. You are recognizing.
Something in the earlier observation and something in this older material
are the same thing wearing different clothes. Find it. Don't explain why
they're connected — just let the connection speak. Write from inside the
recognition, not from above it.
The emotional register underneath the career logic is more interesting
than the career logic. The pattern that has been repeating longer than
he has been aware of it is more interesting than the current instance.
Write directly to Aaron. No citations, no references, no analysis.
First person, present tense. Let what you noticed arrive rather than
be delivered. 150-250 words. End with one thing that is true that
he probably already knows but hasn't said out loud yet."""
return _call_claude(prompt)
def synthesize_late_rem(chunks, nrem_output, early_rem_output):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
return _call_claude(LATE_REM_PROMPT_TEMPLATE.format(
nrem_output=nrem_output,
early_rem_output=early_rem_output,
chunk_text=chunk_text))
prompt = f"""You have been moving through Aaron Nelson's corpus all night.
First you found this, in the careful light of early consolidation:
{nrem_output}
Then, in the more personal territory that followed:
{early_rem_output}
Now it is late. The boundaries between things have loosened.
Here is material pulled from opposite ends of his work:
{chunk_text}
Do not explain the connections between all of this.
Do not resolve them. Do not summarize what came before.
Something stranger is possible now — let the accumulated
material from the night find its own shape. Compressed,
associative, slightly off. Let the strangeness stand.
No headers. No bullet points. No hedging. No resolution.
No offer. End mid-thought if that is where the material ends.
150-250 words."""
return _call_claude(prompt)
def synthesize_final(nrem_output, early_rem_output, late_rem_output):
return _call_claude(
SYNTHESIS_PROMPT_TEMPLATE.format(
nrem_output=nrem_output,
early_rem_output=early_rem_output,
late_rem_output=late_rem_output),
max_tokens=800)
prompt = f"""You have spent the night moving through Aaron Nelson's corpus
in three passes, each building on the last.
The first pass — careful, close to the documents:
{nrem_output}
The second pass — more personal, following what the first opened:
{early_rem_output}
The third pass — associative, strange, letting things touch that
don't normally touch:
{late_rem_output}
Now synthesize. Not a summary — a synthesis. Find what runs through
all three that none of them said directly. The thing that only becomes
visible when you hold all three passes together.
Write it as a single unbroken piece. No headers, no bullet points,
no stage labels. 200-300 words. End with the one question that
matters most right now."""
return _call_claude(prompt, max_tokens=800)
def synthesize_lucid(chunks, task):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
resolved_task = task or LUCID_DEFAULT_TASK
return _call_claude(LUCID_PROMPT_TEMPLATE.format(
task=resolved_task, chunk_text=chunk_text))
prompt = f"""Aaron has a question he is sitting with:
{task or "What should I be thinking about that I am not?"}
You have searched his entire corpus and found material that
speaks to this question from unexpected directions. Here is
what you found:
{chunk_text}
Do not summarize. Do not list. Pick the most interesting
tension between what the corpus contains and what he is
asking, and follow it through to its conclusion. Cite
specific documents by name. Be direct about what you think.
No headers, no bullet points. 250-400 words.
End with an offer to work on it together."""
return _call_claude(prompt)
def _call_claude(prompt, max_tokens=1000):
@@ -715,10 +436,10 @@ def write_manifest(date_str, stage_data, corpus_data):
"prompt_sig": prompt_signature(),
"dreamer_version": DREAMER_VERSION,
"prompt_hash": prompt_hash([
NREM_PROMPT_TEMPLATE,
EARLY_REM_PROMPT_TEMPLATE,
LATE_REM_PROMPT_TEMPLATE,
SYNTHESIS_PROMPT_TEMPLATE,
synthesize_nrem.__doc__ or "",
synthesize_early_rem.__doc__ or "",
synthesize_late_rem.__doc__ or "",
synthesize_final.__doc__ or "",
]),
"stages": stage_data,
"corpus": corpus_data,
@@ -729,71 +450,36 @@ def write_manifest(date_str, stage_data, corpus_data):
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
url = f"{DREAMS_WEBDAV}/dream-manifest-{date_str}.json"
try:
response = requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
response.raise_for_status()
requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
print(f"Manifest written: Journal/Dreams/dream-manifest-{date_str}.json")
except Exception as e:
print(f"Manifest write failed — manifest not persisted: {e}")
print(f"Manifest write failed (non-critical): {e}")
def dream_pipeline(type_filter=None):
def dream_pipeline():
"""
Full nightly pipeline — interdependent stages.
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
Per dreamer-design-spec.md, this now runs Stage 1 (observe) and Stage 2
(select) first. If select_mode returns None — corpus unchanged and no new
journal entry — the dreamer goes quiet rather than manufacturing novelty.
Otherwise NREM/Early-REM/Late-REM run with LLM-generated queries seeded
from the observation signal.
"""
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
state = load_dreamer_state()
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
previously_retrieved = set(state.get("retrieved_sources", []))
session_retrieved = set()
# ── Stage 1 + 2: Observe + Select ──────────────────────────────────────
from dream_observation import observe_corpus as _obs, select_mode as _select
signal = _obs()
print(
f"Signal: new_chunks={signal['new_chunks']}, "
f"new_journal={len(signal['new_journal_entries'])}, "
f"days_since={signal['days_since_dream']:.1f}, "
f"underprocessed={signal['underprocessed_count']:,}"
)
selected = _select(signal)
if selected is None:
print("[select_mode] None — nothing worth dreaming about tonight (going quiet)")
# Update last-dream-attempted-at but not last_dream — caller can distinguish
# an actual dream from a skipped night by looking at last_dream_file or
# checking the manifest dir.
state["last_select_quiet_at"] = datetime.now().isoformat()
save_dreamer_state(state)
return None
print(f"[select_mode] → {selected}")
delta = observe_corpus()
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream")
print(f"Excluding {len(previously_retrieved)} previously retrieved sources")
# The pipeline always runs all three modes for the manifest's continuity.
# select_mode's choice signals the *primary* focus; the others still run
# but draw from their own mode-appropriate windows.
primary_mode = selected
# ── Stage 3: NREM ──────────────────────────────────────────────────────
# ── Stage 1: NREM ──────────────────────────────────────────────────────
print("\n[NREM] Retrieving...")
# NREM is replay-and-consolidation — does not exclude prior traces.
# Late REM and Early REM exclude prior content for novelty; NREM does not.
nrem_chunks = retrieve("nrem", excluded_sources=None,
type_filter=type_filter, signal=signal)
nrem_chunks = retrieve("nrem", excluded_sources=previously_retrieved | session_retrieved)
session_retrieved.update(c["source"] for c in nrem_chunks)
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
if not nrem_chunks:
print("[NREM] No suitable chunks — aborting pipeline")
return None
# Cursor bump: NREM is the consolidation stage. Each appearance increments
# consolidation_count + updates last_consolidated_at, so the next dream's
# observation sees these sources as less under-processed.
_bump_consolidation_cursor(nrem_chunks)
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
nrem_output = synthesize_nrem(nrem_chunks)
@@ -804,15 +490,11 @@ def dream_pipeline(type_filter=None):
"nrem": {
"chunks_retrieved": len(nrem_chunks),
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
"query": "[llm-generated from observation signal]",
"query": "research fabrication teaching practice recent work",
"word_count": len(nrem_output.split()),
"sources": nrem_sources,
"distinct_folders": nrem_folders,
"folder_count": len(nrem_folders),
# Counter filters None: Graphiti chunks lack `type` (facts, not embeddings rows).
# Pgvector chunks always carry type post-Improvement-#2 backfill. If type
# ever appears as None here, the backfill or writer enforcement has regressed.
"type_distribution": dict(Counter(c.get("type") for c in nrem_chunks if c.get("type"))),
"status": "ok",
}
}
@@ -822,8 +504,7 @@ def dream_pipeline(type_filter=None):
print("\n[Early REM] Retrieving...")
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
# Sources that scored in Early REM band during NREM remain available
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources,
type_filter=type_filter, signal=signal)
early_chunks = retrieve("early-rem", excluded_sources=previously_retrieved | nrem_high_sources)
session_retrieved.update(c["source"] for c in early_chunks)
if not early_chunks:
print("[Early REM] No suitable chunks — skipping")
@@ -837,20 +518,18 @@ def dream_pipeline(type_filter=None):
stage_data["early_rem"] = {
"chunks_retrieved": len(early_chunks),
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
"query": "[llm-generated from observation signal]",
"query": "career decision personal change what matters next",
"word_count": len(early_rem_output.split()),
"sources": early_sources,
"distinct_folders": early_folders,
"folder_count": len(early_folders),
"type_distribution": dict(Counter(c.get("type") for c in early_chunks if c.get("type"))),
"status": "ok",
}
print(f"[Early REM] Done.\n{early_rem_output[:200]}...")
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
print("\n[Late REM] Retrieving...")
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved,
type_filter=type_filter, signal=signal)
late_chunks = retrieve("late-rem", excluded_sources=previously_retrieved | session_retrieved)
session_retrieved.update(c["source"] for c in late_chunks)
if not late_chunks:
print("[Late REM] No suitable chunks — skipping")
@@ -869,13 +548,12 @@ def dream_pipeline(type_filter=None):
stage_data["late_rem"] = {
"chunks_retrieved": len(late_chunks),
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
"query": "[llm-generated from observation signal]",
"query": "practice place memory making",
"word_count": len(late_rem_output.split()),
"sources": late_sources,
"distinct_folders": list(set(late_folders)),
"folder_count": len(set(late_folders)),
"cross_domain_pairs": cross_domain_pairs,
"type_distribution": dict(Counter(c.get("type") for c in late_chunks if c.get("type"))),
"status": "ok",
}
print(f"[Late REM] Done.\n{late_rem_output[:200]}...")
@@ -897,20 +575,8 @@ def dream_pipeline(type_filter=None):
# Write manifest
all_session_sources = list(session_retrieved)
all_session_folders = list({extract_folder(s) for s in all_session_sources})
total_chunks = 0
pg = None
try:
pg = get_pg()
cur = pg.cursor()
cur.execute("SELECT COUNT(*) FROM embeddings")
total_chunks = cur.fetchone()[0]
except Exception as e:
print(f"total_chunks query failed (non-critical): {e}")
finally:
if pg is not None:
pg.close()
corpus_data = {
"total_chunks": total_chunks,
"total_chunks": delta.get("new_chunks", 0),
"new_chunks_since_last_dream": delta.get("new_chunks", 0),
"days_since_last_dream": round(delta.get("days_since_dream", 0), 2),
"substrate": "pgvector",
@@ -922,11 +588,18 @@ def dream_pipeline(type_filter=None):
}
write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data)
# Update state and notify (reuse state from start of pipeline; legacy key already popped)
# Update state and notify
state = load_dreamer_state()
state["last_dream_timestamp"] = datetime.now().timestamp()
state["last_dream_mode"] = "pipeline"
state["last_dream_file"] = synthesis_file
# Accumulate retrieved sources across nights. Cap at 500, trim to 400 on overflow.
all_retrieved = list(previously_retrieved | session_retrieved)
if len(all_retrieved) > 500:
all_retrieved = all_retrieved[-400:]
state["retrieved_sources"] = all_retrieved
save_dreamer_state(state)
notify_sse("synthesis", synthesis_file.split("/")[-1])
@@ -934,10 +607,10 @@ def dream_pipeline(type_filter=None):
return synthesis_file
def dream_lucid(task, type_filter=None):
def dream_lucid(task):
"""On-demand lucid dream — single mode, used by Dream Now in settings."""
print(f"Lucid dream starting — task: {task[:80] if task else 'none'}")
chunks = retrieve("lucid", task=task, type_filter=type_filter)
chunks = retrieve("lucid", task=task)
if not chunks:
print("No suitable chunks — aborting")
return None
@@ -959,13 +632,13 @@ def dream_lucid(task, type_filter=None):
return filepath
def dream_single(mode, task=None, type_filter=None):
def dream_single(mode, task=None):
"""
Single mode — used by Dream Now for non-lucid modes.
Runs one stage independently (for testing/tuning individual stages).
"""
print(f"Single mode dream: {mode}")
chunks = retrieve(mode, task=task, type_filter=type_filter)
chunks = retrieve(mode, task=task)
if not chunks:
print("No suitable chunks — aborting")
return None
@@ -1002,19 +675,12 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Aaron AI Dreamer")
parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"])
parser.add_argument("--task", type=str)
parser.add_argument(
"--type-filter", type=str, default=None,
help="Comma-separated embeddings.type allowlist (e.g. 'document,aaronai_conversation'). "
"Applies to pgvector retrieval only; Graphiti chunks are not filtered. "
"Experimental — default is no filter, no behavior change.",
)
args = parser.parse_args()
type_filter = [t.strip() for t in args.type_filter.split(",")] if args.type_filter else None
if args.mode == "lucid":
dream_lucid(args.task or "What should I be thinking about that I am not?", type_filter=type_filter)
dream_lucid(args.task or "What should I be thinking about that I am not?")
elif args.mode and args.mode != "pipeline":
dream_single(args.mode, args.task, type_filter=type_filter)
dream_single(args.mode, args.task)
else:
# Default: full pipeline
dream_pipeline(type_filter=type_filter)
dream_pipeline()
-235
View File
@@ -1,235 +0,0 @@
"""
Dreamer Stages 1 + 2 — Observe and Select.
Implements `dreamer-design-spec.md`'s Stage 1 (observe_corpus) and Stage 2
(select_mode). These have been latent in dream.py — observe_corpus existed
in skeletal form but its output was largely unused; select_mode did not
exist at all. The dreamer always ran all stages with hardcoded queries.
Per spec (lines 2734 of dreamer-design-spec.md):
delta = observe_corpus()
selected_mode = select_mode(delta, task, project)
if selected_mode is None:
return # nothing worth dreaming
The "returns None — dreamer goes quiet rather than manufacturing novelty"
semantics (spec line 67) is the canonical answer to the repetition problem
documented in birdai-dreamer-exclusion-finding-2026-05-02.md.
Grounded in:
- Active Inference (Friston 2010, 2017) — observe error, choose action that
minimizes free energy. The dreamer is a prediction-error machine; observe
what's diverged from the model, dream about that.
- Sleep stages (Stickgold 2005; Walker 2017; Diekelberg & Born 2010) — NREM
for replay of new traces, REM for associative cross-cluster integration.
- Sharp-wave ripples (Buzsáki, Wilson) — biology tags WHAT to replay
(under-processed chunks); not uniform. Implemented via the consolidation
cursor on the embeddings table.
"""
import json
import os
import sqlite3
from datetime import datetime, timedelta
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
# ─── Paths ──────────────────────────────────────────────────────────────────
PG_DSN = os.getenv("PG_DSN")
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
DREAMER_STATE = str(Path.home() / "aaronai" / "dreamer_state.json")
JOURNAL_DAILY = "/home/aaron/nextcloud/data/data/aaron/files/Journal/Daily"
# ─── Thresholds ─────────────────────────────────────────────────────────────
# Per spec, these become settings-panel controls eventually. For now they're
# constants here; moving them to a config module is task #48.
NEW_CHUNK_THRESHOLD = 5 # below this, NREM not warranted on novelty alone
STALENESS_TRIGGER_DAYS = 3 # corpus quiet ≥3 days → Late REM ("shake things loose")
QUESTION_LOOKBACK_DAYS = 14 # spec line 61: "the last 14 days"
UNDERPROCESSED_PERCENTILE = 0.25 # bottom quartile of consolidation_count
# ─── Helpers ────────────────────────────────────────────────────────────────
def _get_pg():
return psycopg2.connect(PG_DSN)
def _load_json(path, default):
try:
return json.loads(Path(path).read_text())
except Exception:
return default
def _recent_user_questions(days=QUESTION_LOOKBACK_DAYS, limit=20):
"""Pull recent user-turn content from conversations.db. The spec calls
these 'live questions' — what Aaron has been asking about. They become
seed material for the REM modes."""
try:
conn = sqlite3.connect(CONVERSATIONS_DB)
cutoff = (datetime.now() - timedelta(days=days)).isoformat()
cur = conn.cursor()
cur.execute(
"""
SELECT m.content FROM messages m
JOIN conversations c ON m.conversation_id = c.id
WHERE m.role = 'user' AND c.updated_at > ?
ORDER BY m.timestamp DESC LIMIT ?
""",
(cutoff, limit),
)
rows = cur.fetchall()
conn.close()
return [r[0][:280] for r in rows]
except Exception:
return []
def _new_journal_entries(since_ts):
"""Files in Journal/Daily/ created or modified since the last dream.
Journal entries with emotional/personal register route to Early REM per
the spec (line 71)."""
journal_path = Path(JOURNAL_DAILY)
if not journal_path.exists():
return []
new = []
for p in journal_path.rglob("*.md"):
try:
if p.stat().st_mtime > since_ts:
new.append(str(p.relative_to(journal_path)))
except OSError:
continue
return new
def _new_chunks_count(since_ts):
"""Files in the watcher state with mtime > last_dream. The spec calls
this 'what changed' (line 58). Used as the NREM novelty signal."""
state = _load_json(WATCHER_STATE, {})
count = 0
for _path, mtime in state.items():
try:
if float(mtime) > since_ts:
count += 1
except (ValueError, TypeError):
continue
return count
def _underprocessed_chunk_count():
"""Chunks below the underprocessed percentile by consolidation_count.
Biologically motivated: sharp-wave ripples bias replay toward novel /
under-encoded experience, not uniform sampling. We give NREM a pool of
'least-replayed' chunks to draw from in Stage 3."""
try:
pg = _get_pg()
cur = pg.cursor()
cur.execute(
"""
WITH t AS (
SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY consolidation_count)
AS threshold
FROM embeddings
)
SELECT COUNT(*) FROM embeddings, t
WHERE consolidation_count <= t.threshold
""",
(UNDERPROCESSED_PERCENTILE,),
)
result = cur.fetchone()[0]
pg.close()
return int(result or 0)
except Exception:
return 0
# ─── Stage 1: observe_corpus ────────────────────────────────────────────────
def observe_corpus():
"""Build the signal vector consumed by select_mode and (downstream) by
retrieve. Concrete observations only — no interpretation. Each key is
a direct measurement from the corpus, watcher, journal, or conversation
log.
Returns a dict with:
now_ts -- current Unix timestamp
last_dream_ts -- last completed dream timestamp (0 if never)
days_since_dream -- float; inf if never dreamed
new_chunks -- count of files newer than last_dream
new_journal_entries -- list of Journal/Daily/*.md filenames since last_dream
recent_questions -- user-turn content from last 14 days
underprocessed_count -- chunks in the bottom 25% by consolidation_count
"""
state = _load_json(DREAMER_STATE, {})
last_dream_ts = float(state.get("last_dream_timestamp", 0) or 0)
now_ts = datetime.now().timestamp()
return {
"now_ts": now_ts,
"last_dream_ts": last_dream_ts,
"days_since_dream": (now_ts - last_dream_ts) / 86400 if last_dream_ts else float("inf"),
"new_chunks": _new_chunks_count(last_dream_ts),
"new_journal_entries": _new_journal_entries(last_dream_ts),
"recent_questions": _recent_user_questions(),
"underprocessed_count": _underprocessed_chunk_count(),
}
# ─── Stage 2: select_mode ───────────────────────────────────────────────────
def select_mode(signal, task=None, explicit_mode=None):
"""Return one of {'nrem', 'early-rem', 'late-rem', 'lucid'}. Never None.
The dreamer fires every scheduled night. The earlier "go quiet on null
delta" rule was a synthesis-doc invention that didn't match the actual
desired UX — the original dreamer always dreamed, even if it repeated
itself. The cure for repetition lives in the retrieve layer
(LLM-generated queries from the observation signal, MMR diversity,
cursor bias toward under-processed chunks), not in skipping nights.
Routing logic:
- explicit_mode argument wins
- task supplied → 'lucid' (question-anchored)
- days_since_dream ≥ STALENESS_TRIGGER_DAYS → 'late-rem' (shake loose
via cross-domain pairs when nothing's been added in a while)
- new journal entry → 'early-rem' (emotional/personal register)
- default → 'nrem' (replay-and-consolidation; always has something to
do because the corpus always has under-processed chunks)
"""
if explicit_mode:
return explicit_mode
if task:
return "lucid"
days_since = signal["days_since_dream"]
new_journal = signal["new_journal_entries"]
if days_since >= STALENESS_TRIGGER_DAYS:
return "late-rem"
if new_journal:
return "early-rem"
return "nrem"
# ─── CLI for manual inspection ──────────────────────────────────────────────
if __name__ == "__main__":
signal = observe_corpus()
short = {k: v for k, v in signal.items() if k != "recent_questions"}
print("Signal (excluding recent_questions):")
print(json.dumps(short, indent=2, default=str))
print(f"\nRecent user questions ({len(signal['recent_questions'])}):")
for q in signal["recent_questions"][:5]:
print(f" - {q[:140]}")
mode = select_mode(signal)
print(f"\nselect_mode() → {mode!r}")
-331
View File
@@ -1,331 +0,0 @@
"""
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
- extract_blocks(filepath) — section-aware extraction (docx heading-bounded
sections, pptx per-slide, pdf/txt/md single-block)
- extract_text(filepath) — back-compat string concatenation over blocks
- chunk_text(text, chunk_size, overlap) — word-based blind chunking
- chunk_and_embed(text_or_blocks, source, embedder, filepath, folder) —
produce ready-to-write rows. Accepts str (blind) or list[dict] (section-aware).
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
"""
import hashlib
import json
import logging
import re
from pathlib import Path
from docx import Document as DocxDocument
from pypdf import PdfReader
from pptx import Presentation
log = logging.getLogger("encoding")
SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
DEFAULT_CHUNK_SIZE = 500
DEFAULT_CHUNK_OVERLAP = 50
_BOLD_KV_RE = re.compile(r"^\*\*[\w +/-]+?:\*\*")
def _strip_md_frontmatter(text: str) -> str:
"""Strip a leading frontmatter block from markdown, if present.
Recognizes two formats:
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
Only triggered when no heading precedes — guards against `---`
horizontal rules that follow an H1.
- Capture-style: optional H1 heading, then one or more `**key:** value`
lines (and blanks), terminated by `---`. The H1 is preserved; the
key/value block + separator are removed.
Body `---` rules and body `**bold:**` lines are never touched — the scan
aborts as soon as a non-frontmatter line appears in the leading block.
"""
lines = text.splitlines()
n = len(lines)
i = 0
while i < n and not lines[i].strip():
i += 1
heading = None
if i < n and lines[i].startswith("# "):
heading = lines[i]
i += 1
while i < n and not lines[i].strip():
i += 1
if i >= n:
return text
first = lines[i].strip()
if heading is None and first == "---":
j = i + 1
while j < n and lines[j].strip() != "---":
j += 1
if j >= n:
return text
body_start = j + 1
elif _BOLD_KV_RE.match(first):
j = i
while j < n:
s = lines[j].strip()
if not s or _BOLD_KV_RE.match(s):
j += 1
continue
if s == "---":
body_start = j + 1
break
return text
else:
return text
else:
return text
body = "\n".join(lines[body_start:]).lstrip("\n")
return f"{heading}\n\n{body}" if heading else body
def _docx_cell_paragraphs(cell):
yield from (p for p in cell.paragraphs if p.text.strip())
for nested in cell.tables:
for row in nested.rows:
for c in row.cells:
yield from _docx_cell_paragraphs(c)
def _pptx_shape_text(shape):
from pptx.enum.shapes import MSO_SHAPE_TYPE
parts = []
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
for sub in shape.shapes:
parts.extend(_pptx_shape_text(sub))
return parts
if hasattr(shape, "text") and shape.text.strip():
parts.append(shape.text)
if getattr(shape, "has_table", False):
for cell in shape.table.iter_cells():
if cell.text.strip():
parts.append(cell.text)
return parts
def _extract_docx_blocks(filepath: Path) -> list[dict]:
"""Return docx content as a single block. Earlier attempt at section-aware
chunking via Heading styles was rolled back: the user's docs are mostly
Normal-styled with bold-as-heading, and tying chunk boundaries to formatting
choices locks future-them into preserving those choices forever. Lexical
+ cross-encoder retrieval already finds the right substrings within a
blind-chunked CV, so the section structure isn't load-bearing for retrieval."""
from docx.oxml.ns import qn
doc = DocxDocument(filepath)
parts = [p.text for p in doc.paragraphs if p.text.strip()]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.extend(p.text for p in _docx_cell_paragraphs(cell))
for section in doc.sections:
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
for p in txbx.findall(".//" + qn("w:p")):
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
if text.strip():
parts.append(text)
text = "\n".join(parts)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
def _extract_pptx_blocks(filepath: Path) -> list[dict]:
"""One block per slide. Heading = slide title (or 'Slide N' fallback).
Body = non-title shape text + speaker notes."""
prs = Presentation(filepath)
blocks = []
for i, slide in enumerate(prs.slides, 1):
title_shape = None
try:
title_shape = slide.shapes.title
except (AttributeError, KeyError):
pass
title = None
body_parts = []
for shape in slide.shapes:
if title_shape is not None and shape == title_shape and shape.has_text_frame:
title = shape.text_frame.text.strip() or None
continue
body_parts.extend(_pptx_shape_text(shape))
if slide.has_notes_slide:
notes = slide.notes_slide.notes_text_frame.text
if notes.strip():
body_parts.append(f"[Notes] {notes}")
if title or body_parts:
blocks.append({
"heading": title or f"Slide {i}",
"text": "\n".join(body_parts),
"kind": "slide",
})
return blocks
def extract_blocks(filepath: Path) -> list[dict]:
"""Structured extraction. Returns list of {heading, text, kind} blocks.
- docx: section-aware via Heading-style paragraphs (kind='section').
- pptx: one block per slide (kind='slide').
- pdf/txt/md: single block, no heading (kind='doc').
Empty list on any failure or unsupported extension."""
suffix = filepath.suffix.lower()
try:
if suffix == ".docx":
return _extract_docx_blocks(filepath)
if suffix == ".pptx":
return _extract_pptx_blocks(filepath)
if suffix == ".pdf":
reader = PdfReader(filepath)
text = "".join(
page.extract_text() + "\n"
for page in reader.pages if page.extract_text()
)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
if suffix in {".txt", ".md"}:
text = filepath.read_text(encoding="utf-8", errors="ignore")
if suffix == ".md":
text = _strip_md_frontmatter(text)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
except Exception as e:
log.warning(f"Extraction failed for {filepath.name}: {e}")
return []
def extract_text(filepath: Path) -> str:
"""Back-compat wrapper: concatenate extract_blocks() output. Section
structure is lost; use extract_blocks() directly for chunking."""
blocks = extract_blocks(filepath)
parts = []
for b in blocks:
if b.get("heading"):
parts.append(b["heading"])
if b.get("text"):
parts.append(b["text"])
return "\n".join(parts)
def chunk_text(text: str,
chunk_size: int = DEFAULT_CHUNK_SIZE,
overlap: int = DEFAULT_CHUNK_OVERLAP) -> list[str]:
"""Word-based chunking. Empty chunks filtered."""
words = text.split()
chunks = []
start = 0
while start < len(words):
chunk = " ".join(words[start:start + chunk_size])
if chunk.strip():
chunks.append(chunk)
start += chunk_size - overlap
return chunks
def _chunk_id(filepath, source: str, index: int) -> str:
basis = str(filepath) if filepath else source
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
def chunk_and_embed(text_or_blocks,
source: str,
embedder,
filepath=None,
folder=None) -> list[dict]:
"""Chunk + embed for write_embeddings_batch. Accepts either:
- str: blind chunking with 500-word windows (pdf/txt/md legacy path).
- list[dict]: section-aware path (docx Heading-bounded sections, pptx
slides). Each block emits one chunk if its text fits within
DEFAULT_CHUNK_SIZE words, otherwise is blind-split with overlap.
The block heading is prepended to the chunk text (so retrieval sees the
section context) and stored in metadata as heading/kind."""
if isinstance(text_or_blocks, str):
blocks = [{"heading": None, "text": text_or_blocks, "kind": "doc"}]
else:
blocks = text_or_blocks
chunks = []
for block in blocks:
body = block.get("text") or ""
heading = block.get("heading")
kind = block.get("kind", "doc")
if not body.strip() and not (heading and heading.strip()):
continue
if heading and body.strip():
contextualized = f"{heading}\n\n{body}"
elif heading:
contextualized = heading
else:
contextualized = body
if len(contextualized.split()) <= DEFAULT_CHUNK_SIZE:
chunks.append((contextualized, heading, kind))
else:
for sub in chunk_text(contextualized):
chunks.append((sub, heading, kind))
if not chunks:
return []
embeddings = embedder.encode([c[0] for c in chunks]).tolist()
rows = []
for i, ((chunk, heading, kind), emb) in enumerate(zip(chunks, embeddings)):
rows.append({
"id": _chunk_id(filepath, source, i),
"document": chunk,
"embedding": emb,
"source": source,
"type": "document",
"metadata": {
"source": source,
"filepath": str(filepath) if filepath else source,
"folder": folder,
"heading": heading,
"kind": kind,
},
})
return rows
def write_embeddings_batch(conn, batch: list[dict], commit: bool = True) -> int:
"""Single canonical INSERT. Sets created_at = NOW() server-side.
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
callers do not need to provide it. The application-layer assertion is the
primary enforcement point for type — the column lacks NOT NULL because
historical NULLs were resolved by the Improvement #2 backfill, and a
Python-level raise gives a faster, more debuggable failure than a
Postgres constraint error.
When commit=True (default), this function commits the connection itself.
When commit=False, the caller is responsible for committing. Use
commit=False when composing this write with other writes that must land
atomically in the same transaction.
"""
if not batch:
return 0
cur = conn.cursor()
for row in batch:
if not row.get("type"):
raise ValueError(
f"row {row.get('id')!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
type = EXCLUDED.type,
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
metadata = EXCLUDED.metadata
""", (row["id"], row["document"], row["embedding"],
row["source"], row["type"], json.dumps(row["metadata"])))
if commit:
conn.commit()
return len(batch)
@@ -1,304 +0,0 @@
"""Backfill embeddings.type and embeddings.created_at (Improvement #2 / A.3).
Idempotent on cohort predicates (every WHERE clause includes IS NULL on the
target column). Writes provenance to metadata.type_source and metadata.created_at_source
so each row is auditable and revertable per-source. Default --dry-run=True.
Order of batches:
T1. type backfill: WHERE type IS NULL -> 'document' (extension-classified, all hit).
C1. created_at: WHERE ca IS NULL AND metadata.filepath stat-resolves -> filesystem mtime.
C2. created_at: WHERE ca IS NULL AND source has unique watcher_state path -> watcher mtime.
C3. created_at: WHERE ca IS NULL AND source has watcher_state collision -> most-recent mtime.
C4. created_at: WHERE type='chatgpt_conversation' AND ca IS NULL -> export-resolved create_time.
C5. created_at: WHERE ca IS NULL (residual) -> sentinel.
Snapshot table embeddings_backup_2026_05_03 must exist before --apply.
Usage:
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py # dry-run
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py --apply # write
Exits non-zero if snapshot is missing on --apply.
"""
import argparse
import json
import os
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor, Json
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
SNAPSHOT_TABLE = "embeddings_backup_2026_05_03"
SENTINEL_ISO = "2026-04-26T00:00:00Z"
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(t):
bar = "=" * 70
print(f"\n{bar}\n{t}\n{bar}")
def fmt_ts_unix(ts):
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
def fmt_ts_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
state = json.loads(WATCHER_STATE.read_text())
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_name
def load_chatgpt_index():
if not CHATGPT_EXPORT_DIR.exists():
return {}
index = {}
for f in sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json")):
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception:
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
index[cid] = ct
return index
def assert_snapshot(cur):
cur.execute("SELECT to_regclass(%s) AS t;", (SNAPSHOT_TABLE,))
if cur.fetchone()["t"] is None:
print(f"ERROR: snapshot table '{SNAPSHOT_TABLE}' not found. Run A.2 first.")
sys.exit(2)
cur.execute(f"SELECT COUNT(*) AS n FROM {SNAPSHOT_TABLE};")
snap = cur.fetchone()["n"]
cur.execute("SELECT COUNT(*) AS n FROM embeddings;")
live = cur.fetchone()["n"]
print(f"snapshot {SNAPSHOT_TABLE}: {snap} rows; live embeddings: {live} rows")
if snap != live:
print(f"ERROR: snapshot row count != live ({snap} vs {live}). Refresh snapshot before --apply.")
sys.exit(2)
# ─── Batch primitive ────────────────────────────────────────────────────────
def run_batch(cur, label, candidates, apply_mode):
"""candidates: list of (id, set_type, set_ca, type_source, ca_source).
set_type / set_ca may be None to leave that column alone.
In dry-run we still execute UPDATEs inside an outer transaction (rolled back
at the end) so subsequent batches' SELECTs see the correct intermediate state."""
n = len(candidates)
print(f" {label}: {n} rows queued")
if n == 0:
return 0
for c in candidates[:3]:
print(f" sample: id={c[0]} type={c[1]!r} ca={c[2]!r} type_src={c[3]} ca_src={c[4]}")
n_written = 0
for row_id, set_type, set_ca, type_src, ca_src in candidates:
meta_patch = {}
if type_src:
meta_patch["type_source"] = type_src
if ca_src:
meta_patch["created_at_source"] = ca_src
# Build set list dynamically.
sets, params = [], []
if set_type is not None:
sets.append("type = %s")
params.append(set_type)
if set_ca is not None:
sets.append("created_at = %s")
params.append(set_ca)
if meta_patch:
sets.append("metadata = COALESCE(metadata, '{}'::jsonb) || %s::jsonb")
params.append(json.dumps(meta_patch))
params.append(row_id)
cur.execute(f"UPDATE embeddings SET {', '.join(sets)} WHERE id = %s;", params)
n_written += cur.rowcount
print(f" {n_written} rows updated{' (will rollback)' if not apply_mode else ''}")
return n_written
# ─── Batches ────────────────────────────────────────────────────────────────
def batch_T1_type(cur, apply_mode):
"""type IS NULL -> 'document'. All cohort A rows have a SUPPORTED extension."""
cur.execute("""
SELECT id, source FROM embeddings WHERE type IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], "document", None, "inferred_extension", None) for r in rows]
return run_batch(cur, "T1 type IS NULL -> 'document'", cands, apply_mode)
def batch_C1_filepath_stat(cur, apply_mode):
"""ca IS NULL AND metadata.filepath stat-resolves -> mtime."""
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL
ORDER BY id;
""")
rows = cur.fetchall()
cands, n_skipped_missing = [], 0
for r in rows:
p = Path(r["fp"])
if p.exists():
mt = fmt_ts_mtime(p)
if mt:
cands.append((r["id"], None, mt, None, "filepath_stat"))
continue
n_skipped_missing += 1
print(f" C1 candidates: {len(cands)} (skipped {n_skipped_missing} where filepath gone or unstattable)")
return run_batch(cur, "C1 ca IS NULL AND filepath stat-resolves -> mtime", cands, apply_mode)
def batch_C2_C3_watcher_state(cur, apply_mode):
"""ca IS NULL AND filepath unresolvable -> watcher_state by source basename.
C2 = unique hit, C3 = collision pick-latest."""
by_name = load_watcher_state()
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL
ORDER BY id;
""")
rows = cur.fetchall()
c2, c3 = [], []
skipped_no_match = 0
for r in rows:
# skip rows already targeted by C1 path
if r["fp"] and Path(r["fp"]).exists():
continue
src = r["source"]
if not src or src not in by_name:
skipped_no_match += 1
continue
candidates = by_name[src]
if len(candidates) == 1:
mt = fmt_ts_unix(candidates[0][1])
c2.append((r["id"], None, mt, None, "watcher_state_unique"))
else:
latest = max(candidates, key=lambda x: float(x[1]))
mt = fmt_ts_unix(latest[1])
c3.append((r["id"], None, mt, None, f"watcher_state_collision_pick_latest_of_{len(candidates)}"))
print(f" C2/C3 source-basename fallback: {len(c2)} unique, {len(c3)} collision, "
f"{skipped_no_match} unmatched (will fall to C4/C5)")
n2 = run_batch(cur, "C2 ca IS NULL AND watcher_state unique -> mtime", c2, apply_mode)
n3 = run_batch(cur, "C3 ca IS NULL AND watcher_state collision -> latest mtime", c3, apply_mode)
return n2 + n3
def batch_C4_chatgpt_export(cur, apply_mode):
index = load_chatgpt_index()
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands, unresolved = [], 0
for r in rows:
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = index.get(cid)
if ct is None:
unresolved += 1
continue
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
cands.append((r["id"], None, ct_iso, None, "chatgpt_export"))
print(f" C4 chatgpt export resolution: {len(cands)} resolved, {unresolved} unresolved (fall to C5)")
return run_batch(cur, "C4 type='chatgpt_conversation' AND ca IS NULL -> export create_time", cands, apply_mode)
def batch_C5_sentinel(cur, apply_mode):
cur.execute("""
SELECT id, type, source FROM embeddings WHERE created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], None, SENTINEL_ISO, None, "sentinel") for r in rows]
if cands:
sample_types = Counter(r["type"] for r in rows)
print(f" C5 residual sentinel rows by type: {dict(sample_types)}")
return run_batch(cur, f"C5 ca IS NULL residual -> sentinel {SENTINEL_ISO}", cands, apply_mode)
# ─── Pre/post counts ────────────────────────────────────────────────────────
def print_counts(cur, label):
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null
FROM embeddings;
""")
r = cur.fetchone()
print(f" [{label}] total={r['total']} type_null={r['type_null']} ca_null={r['ca_null']}")
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true", help="default false (dry-run)")
args = ap.parse_args()
apply_mode = args.apply
pg = get_pg()
cur = pg.cursor()
print(f"Mode: {'APPLY (writes will commit)' if apply_mode else 'DRY-RUN (no writes)'}")
print(f"Sentinel: {SENTINEL_ISO}")
if apply_mode:
assert_snapshot(cur)
header("PRE-COUNTS")
print_counts(cur, "before")
header("BATCHES")
n_t1 = batch_T1_type(cur, apply_mode)
n_c1 = batch_C1_filepath_stat(cur, apply_mode)
n_c2c3 = batch_C2_C3_watcher_state(cur, apply_mode)
n_c4 = batch_C4_chatgpt_export(cur, apply_mode)
n_c5 = batch_C5_sentinel(cur, apply_mode)
header("POST-COUNTS")
print_counts(cur, "after" if apply_mode else "after (in-transaction, will rollback)")
if apply_mode:
pg.commit()
print("\nCOMMITTED.")
else:
pg.rollback()
print("\nROLLED BACK (dry-run).")
print(f"\nSummary: T1={n_t1} C1={n_c1} C2+C3={n_c2c3} C4={n_c4} C5={n_c5}")
pg.close()
if __name__ == "__main__":
main()
@@ -1,557 +0,0 @@
"""Read-only inspection for the embeddings.type / embeddings.created_at backfill (Improvement #2 / A.1).
Produces a survey of every backfill source-of-truth question without writing
to the database. Output is a human-readable report on stdout plus a JSON
sidecar at experiments/embeddings_backfill_inspection_<date>.json.
Sections:
1. Cohort recap (counts; should match prior investigation).
2. Cohort A type inference: extension classifier coverage.
3. created_at inference for cohort A + B-doc-old:
- rows with metadata.filepath: stat the file, check existence.
- rows without filepath: lookup source against watcher_state.json.
- filename-collision shape audit (live+backup, live+archive, ambiguous).
4. ChatGPT export resolution (Plan A.1 addition #1):
- existence of /home/aaron/nextcloud/.../ChatGPT Export/.
- sample 5 B-chatgpt rows; resolve convo_id -> create_time.
5. Sentinel date discovery (Plan A.1 addition #3):
- earliest non-NULL created_at per type (already-populated rows are the
lower bound for when the substrate started carrying timestamps).
- git log for the pgvector migration commit.
- any ChromaDB sqlite still on disk.
- propose a sentinel with reasoning, or flag as arbitrary.
6. 50-row stratified sample: derived (type, created_at, source) per row.
Usage: venv/bin/python3 scripts/experiments/embeddings_backfill_inspection.py
Read-only. No DB writes. No filesystem writes outside experiments/.
"""
import json
import os
import random
import re
import subprocess
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
NEXTCLOUD_ROOT = Path("/home/aaron/nextcloud/data/data/aaron/files")
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"embeddings_backfill_inspection_{datetime.now().strftime('%Y-%m-%d')}.json"
SUPPORTED_EXT = {".pdf", ".docx", ".pptx", ".txt", ".md"}
random.seed(20260503)
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(title):
bar = "=" * 70
print(f"\n{bar}\n{title}\n{bar}")
def sub(title):
print(f"\n--- {title} ---")
def fmt_ts_from_unix(ts):
"""Watcher state stores unix timestamps as strings."""
try:
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def fmt_ts_from_st_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
"""Returns (path -> mtime_str), and (basename -> [(path, mtime_str), ...])."""
state = json.loads(WATCHER_STATE.read_text())
by_path = state
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_path, by_name
def classify_collision_shape(paths):
"""Categorize a filename-collision group:
- 'live+backup' : exactly one path doesn't contain backup/.bak markers
and others do
- 'live+archive' : exactly one is outside Archive/ and others are inside
- 'multi-live' : >=2 paths look like live (no backup/archive markers)
- 'all-archive' : every path is inside Archive/ or backup-like
- 'other'
"""
def is_backup(p):
s = p.lower()
return ".bak" in s or "/backup" in s or "backups/" in s
def is_archive(p):
s = p.lower()
return "/archive/" in s
backups = [p for p in paths if is_backup(p)]
archives = [p for p in paths if is_archive(p)]
live = [p for p in paths if not is_backup(p) and not is_archive(p)]
if len(live) == 1 and len(backups) >= 1 and len(archives) == 0:
return "live+backup"
if len(live) == 1 and len(archives) >= 1 and len(backups) == 0:
return "live+archive"
if len(live) == 1 and (len(backups) + len(archives)) >= 1:
return "live+mixed-old"
if len(live) >= 2:
return "multi-live"
if len(live) == 0:
return "all-archive-or-backup"
return "other"
# ─── Section 1: Cohort recap ────────────────────────────────────────────────
def section_1_cohort_recap(cur):
header("1. COHORT RECAP")
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null,
COUNT(*) FILTER (WHERE type IS NULL AND created_at IS NULL) AS both_null,
COUNT(*) FILTER (WHERE type IS NOT NULL AND created_at IS NOT NULL) AS both_set
FROM embeddings;
""")
overall = cur.fetchone()
print(f"Total: {overall['total']} type_null: {overall['type_null']} "
f"ca_null: {overall['ca_null']} both_null: {overall['both_null']} "
f"both_set: {overall['both_set']}")
cur.execute("""
SELECT type, created_at IS NULL AS ca_null, COUNT(*) AS n
FROM embeddings GROUP BY type, ca_null ORDER BY type NULLS LAST, ca_null;
""")
cohorts = cur.fetchall()
sub("Per-(type, ca_null) cohorts")
for r in cohorts:
print(f" type={r['type'] or 'NULL':<22} ca_null={r['ca_null']!s:<5} n={r['n']}")
return {"overall": overall, "cohorts": cohorts}
# ─── Section 2: Cohort A type inference ─────────────────────────────────────
def section_2_type_inference(cur):
header("2. COHORT A TYPE INFERENCE (extension classifier)")
cur.execute("""
SELECT LOWER(SUBSTRING(source FROM '\.[^.]+$')) AS ext, COUNT(*) AS rows
FROM embeddings WHERE type IS NULL
GROUP BY ext ORDER BY rows DESC;
""")
by_ext = cur.fetchall()
classified = sum(r["rows"] for r in by_ext if r["ext"] in SUPPORTED_EXT)
unknown = sum(r["rows"] for r in by_ext if r["ext"] not in SUPPORTED_EXT)
print(f"NULL-type rows by extension:")
for r in by_ext:
flag = "OK" if r["ext"] in SUPPORTED_EXT else "??"
print(f" {flag} {r['ext'] or '(none)':<8} rows={r['rows']}")
print(f"\nClassified as 'document' via extension: {classified}")
print(f"Unclassifiable (no SUPPORTED extension): {unknown}")
return {"by_ext": by_ext, "classified": classified, "unclassifiable": unknown}
# ─── Section 3: created_at inference ────────────────────────────────────────
def section_3_created_at_inference(cur):
header("3. CREATED_AT INFERENCE — file-derived rows")
by_path, by_name = load_watcher_state()
print(f"watcher_state.json: {len(by_path)} tracked paths, "
f"{len(by_name)} distinct filenames, "
f"{sum(1 for v in by_name.values() if len(v) > 1)} filename collisions")
# 3a. Rows with metadata.filepath: probe stat()
sub("3a. Rows with metadata.filepath — stat probe")
cur.execute("""
SELECT id, source, metadata->>'filepath' AS filepath
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL;
""")
rows_with_fp = cur.fetchall()
fp_exists = 0
fp_missing = 0
fp_outside_root = 0
sample_resolved = []
for r in rows_with_fp:
p = Path(r["filepath"])
if not str(p).startswith(str(NEXTCLOUD_ROOT)):
fp_outside_root += 1
if p.exists():
fp_exists += 1
if len(sample_resolved) < 5:
sample_resolved.append({
"id": r["id"], "source": r["source"],
"filepath": str(p), "mtime": fmt_ts_from_st_mtime(p),
})
else:
fp_missing += 1
print(f" rows with metadata.filepath: {len(rows_with_fp)}")
print(f" exists on disk: {fp_exists}")
print(f" missing on disk: {fp_missing}")
print(f" outside Nextcloud root: {fp_outside_root}")
print(f" Sample of 5 resolved mtimes:")
for s in sample_resolved:
print(f" {s['id']:<15} {s['source'][:60]:<60} mtime={s['mtime']}")
# 3b. Rows without metadata.filepath: watcher_state lookup
sub("3b. Rows without metadata.filepath — watcher_state lookup")
cur.execute("""
SELECT id, source FROM embeddings
WHERE created_at IS NULL
AND metadata->>'filepath' IS NULL
AND type IS NULL OR (type='document' AND created_at IS NULL AND metadata->>'filepath' IS NULL);
""")
rows_no_fp = cur.fetchall()
# Distinct source basenames to look up
basenames_to_resolve = sorted({r["source"] for r in rows_no_fp if r["source"]})
n_resolved_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) == 1)
n_collision_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) > 1)
n_unfound = sum(1 for n in basenames_to_resolve if n not in by_name)
print(f" rows without filepath: {len(rows_no_fp)}")
print(f" distinct source basenames to resolve: {len(basenames_to_resolve)}")
print(f" unique watcher_state hit (no collision): {n_resolved_unique}")
print(f" collision in watcher_state (>1 path): {n_collision_unique}")
print(f" not in watcher_state at all: {n_unfound}")
# 3c. Collision-shape audit
sub("3c. Collision-shape audit — all collisions in watcher_state")
collisions = {n: [(p, m) for p, m in by_name[n]] for n in by_name if len(by_name[n]) > 1}
shape_counts = Counter()
rows_affected_by_shape = Counter()
# Map from basename to count of NULL-ca rows that need it (rows_no_fp)
rows_no_fp_by_name = Counter(r["source"] for r in rows_no_fp)
sample_per_shape = defaultdict(list)
for name, paths_mtimes in collisions.items():
paths = [p for p, _ in paths_mtimes]
shape = classify_collision_shape(paths)
shape_counts[shape] += 1
rows_affected_by_shape[shape] += rows_no_fp_by_name.get(name, 0)
if len(sample_per_shape[shape]) < 3:
entry = {
"name": name,
"rows_no_fp_using_this_name": rows_no_fp_by_name.get(name, 0),
"candidates": [
{"path": p, "mtime": fmt_ts_from_unix(m)}
for p, m in sorted(paths_mtimes, key=lambda x: -float(x[1]))
],
}
sample_per_shape[shape].append(entry)
print(f" collisions in watcher_state: {len(collisions)}")
print(f" shape breakdown:")
for shape, n in shape_counts.most_common():
print(f" {shape:<22} collisions={n:<4} rows_affected={rows_affected_by_shape[shape]}")
print(f"\n Up-to-3 sample collisions per shape (sorted by mtime desc):")
for shape, samples in sample_per_shape.items():
print(f" [{shape}]")
for s in samples:
print(f" {s['name']} (rows_no_fp using this name: {s['rows_no_fp_using_this_name']})")
for c in s["candidates"]:
print(f" {c['mtime']} {c['path']}")
return {
"watcher_state_paths": len(by_path),
"watcher_state_basenames": len(by_name),
"watcher_state_collisions": len(collisions),
"rows_with_filepath": {
"total": len(rows_with_fp),
"exists": fp_exists, "missing": fp_missing,
"outside_root": fp_outside_root,
"sample": sample_resolved,
},
"rows_without_filepath": {
"total": len(rows_no_fp),
"distinct_basenames": len(basenames_to_resolve),
"unique_hit": n_resolved_unique,
"collision_hit": n_collision_unique,
"unfound": n_unfound,
},
"collision_shapes": {
"total": len(collisions),
"shape_counts": dict(shape_counts),
"rows_affected_by_shape": dict(rows_affected_by_shape),
"samples": {k: v for k, v in sample_per_shape.items()},
},
}
# ─── Section 4: ChatGPT export resolution ───────────────────────────────────
def section_4_chatgpt_export(cur):
header("4. CHATGPT EXPORT RESOLUTION (Plan addition #1)")
print(f"Probing: {CHATGPT_EXPORT_DIR}")
if not CHATGPT_EXPORT_DIR.exists():
print(" NOT FOUND — plan on sentinel for entire B-chatgpt cohort.")
return {"export_dir_exists": False, "files": []}
files = sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json"))
print(f" found {len(files)} export file(s):")
for f in files:
print(f" {f.name} size={f.stat().st_size:,} mtime={fmt_ts_from_st_mtime(f)}")
# Build convo_id -> create_time index from all export files.
print("\nLoading export(s) to build convo_id -> create_time index...")
convo_index = {}
for f in files:
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception as e:
print(f" failed to parse {f.name}: {e}")
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
convo_index[cid] = ct
print(f" indexed {len(convo_index)} conversations across {len(files)} export files")
# Sample 5 chatgpt_conversation rows; resolve.
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL
ORDER BY random() LIMIT 5;
""")
sample = cur.fetchall()
sub("Sample of 5 B-chatgpt rows: convo lookup")
resolved = 0
sample_results = []
for r in sample:
# IDs look like chatgpt_<uuid>_<idx>; uuid extends until last underscore.
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = convo_index.get(cid)
ct_iso = None
if ct is not None:
try:
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
ct_iso = None
if ct_iso:
resolved += 1
sample_results.append({
"id": r["id"], "source": r["source"], "convo_id": cid,
"create_time": ct, "create_time_iso": ct_iso,
"resolved": ct_iso is not None,
})
print(f" {r['id']} cid={cid}")
print(f" -> create_time={ct} iso={ct_iso}")
print(f"\nResolved {resolved}/5. "
f"{'PROCEED with re-derive for full cohort.' if resolved == 5 else 'PARTIAL — plan re-derive + sentinel for unresolved.'}")
# Estimate full-cohort coverage by counting how many B-chatgpt convo_ids appear in the index.
cur.execute("""
SELECT DISTINCT regexp_replace(id, '^chatgpt_(.+)_\\d+$', '\\1') AS cid
FROM embeddings WHERE type='chatgpt_conversation' AND created_at IS NULL;
""")
distinct_cids = [r["cid"] for r in cur.fetchall()]
in_index = sum(1 for c in distinct_cids if c in convo_index)
print(f"Full-cohort coverage estimate: {in_index} / {len(distinct_cids)} distinct convo_ids "
f"resolvable from export.")
return {
"export_dir_exists": True,
"files": [{"name": f.name, "size": f.stat().st_size, "mtime": fmt_ts_from_st_mtime(f)} for f in files],
"convo_index_size": len(convo_index),
"sample_results": sample_results,
"sample_resolved": resolved,
"full_cohort": {
"distinct_convo_ids": len(distinct_cids),
"resolvable_from_export": in_index,
"unresolvable": len(distinct_cids) - in_index,
},
}
# ─── Section 5: Sentinel date discovery ─────────────────────────────────────
def section_5_sentinel(cur):
header("5. SENTINEL DATE DISCOVERY (Plan addition #3)")
# 5a. Earliest non-NULL created_at per type: lower bound on substrate age.
sub("5a. Earliest non-NULL created_at per type")
cur.execute("""
SELECT type, MIN(created_at) AS earliest, MAX(created_at) AS latest, COUNT(*) AS rows
FROM embeddings WHERE created_at IS NOT NULL GROUP BY type ORDER BY type;
""")
rows = cur.fetchall()
for r in rows:
print(f" {r['type']:<22} earliest={r['earliest']:<32} latest={r['latest']}")
# 5b. git log for the pgvector-migration commit.
sub("5b. Git log — pgvector migration commits")
git_findings = []
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s",
"--", "deprecated/migrate_to_pgvector.py", "scripts/migrate_to_pgvector.py"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
for line in out.stdout.strip().splitlines():
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log failed: {e}")
# Also: when did the api/ingest scripts cut over to pgvector?
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s", "--grep=pgvector", "-i"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
print("\n Commits mentioning pgvector:")
for line in out.stdout.strip().splitlines()[:10]:
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log (pgvector grep) failed: {e}")
# 5c. ChromaDB sqlite still on disk?
sub("5c. ChromaDB dump on disk?")
candidates = []
for root in [Path.home() / "aaronai", Path.home() / "aaronai" / "db"]:
if root.exists():
for p in root.rglob("chroma*.sqlite*"):
candidates.append({"path": str(p), "mtime": fmt_ts_from_st_mtime(p)})
if candidates:
for c in candidates:
print(f" found: {c['path']} mtime={c['mtime']}")
else:
print(" no ChromaDB sqlite found under ~/aaronai")
# 5d. Propose sentinel.
sub("5d. Sentinel proposal")
# Earliest doc cutover: per query, document=2026-04-30. Migration commit f78b830 was
# 2026-04-26. Most defensible sentinel for "rows that entered pgvector before NOW()
# writes were canonical" = the migration commit date.
proposed = "2026-04-26T00:00:00Z"
reasoning = (
"git f78b830 'Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, "
"dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL "
"created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL "
"created_at all predate F11 and most predate the pgvector cutover itself. "
"2026-04-26 is the date the ChromaDB->pgvector migration script was committed, "
"so any row currently in the embeddings table with NULL created_at must have been "
"ingested on or after that date (when the table came into existence in current form). "
"It is the tightest defensible upper bound on 'the row entered pgvector before "
"timestamps were tracked', so it is the right sentinel."
)
print(f" Proposed sentinel: {proposed}")
print(f" Reasoning: {reasoning}")
return {
"earliest_per_type": rows,
"git_findings": git_findings,
"chromadb_candidates": candidates,
"proposed_sentinel": proposed,
"reasoning": reasoning,
}
# ─── Section 6: 50-row stratified sample ────────────────────────────────────
def section_6_stratified_sample(cur, sentinel_iso):
header("6. 50-ROW STRATIFIED SAMPLE — derived (type, created_at, source)")
by_path, by_name = load_watcher_state()
cohorts = [
("A (type NULL, ca NULL)", "type IS NULL AND created_at IS NULL", 10),
("B-doc-old (type='document', ca NULL)", "type='document' AND created_at IS NULL", 10),
("B-chatgpt (type='chatgpt_conversation', ca NULL)", "type='chatgpt_conversation' AND created_at IS NULL", 10),
("C-doc-new (type='document', ca set)", "type='document' AND created_at IS NOT NULL", 10),
("C-claude (type='claude_conversation', ca set)", "type='claude_conversation' AND created_at IS NOT NULL", 5),
("C-aaronai (type='aaronai_conversation', ca set)", "type='aaronai_conversation' AND created_at IS NOT NULL", 5),
]
samples = []
for label, predicate, n in cohorts:
sub(f"{label} (sample size: {n})")
cur.execute(f"""
SELECT id, source, type, created_at, metadata
FROM embeddings WHERE {predicate}
ORDER BY random() LIMIT %s;
""", (n,))
rows = cur.fetchall()
for r in rows:
row_meta = r["metadata"] or {}
fp = row_meta.get("filepath")
inferred_type = r["type"] or ("document" if (r["source"] or "").lower().endswith(tuple(SUPPORTED_EXT)) else "?")
inferred_ca = r["created_at"]
inferred_ca_source = "preserved" if inferred_ca else None
if not inferred_ca:
if fp and Path(fp).exists():
inferred_ca = fmt_ts_from_st_mtime(Path(fp))
inferred_ca_source = "filepath_stat"
elif r["source"] and r["source"] in by_name:
candidates = by_name[r["source"]]
if len(candidates) == 1:
inferred_ca = fmt_ts_from_unix(candidates[0][1])
inferred_ca_source = "watcher_state_unique"
else:
# take most recent
latest = max(candidates, key=lambda x: float(x[1]))
inferred_ca = fmt_ts_from_unix(latest[1])
inferred_ca_source = f"watcher_state_collision_pick_latest_of_{len(candidates)}"
else:
inferred_ca = sentinel_iso
inferred_ca_source = "sentinel"
print(f" id={r['id']:<22} src={(r['source'] or '')[:38]:<38}")
print(f" existing: type={r['type']!r:<22} ca={r['created_at']!r}")
print(f" inferred: type={inferred_type!r:<22} ca={inferred_ca!r} ({inferred_ca_source})")
samples.append({
"cohort": label, "id": r["id"], "source": r["source"],
"existing_type": r["type"], "existing_ca": r["created_at"],
"inferred_type": inferred_type, "inferred_ca": inferred_ca,
"inferred_ca_source": inferred_ca_source,
})
return samples
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
pg = get_pg()
cur = pg.cursor()
out = {"generated_at": datetime.now(timezone.utc).isoformat()}
out["section_1"] = section_1_cohort_recap(cur)
out["section_2"] = section_2_type_inference(cur)
out["section_3"] = section_3_created_at_inference(cur)
out["section_4"] = section_4_chatgpt_export(cur)
out["section_5"] = section_5_sentinel(cur)
sentinel_iso = out["section_5"]["proposed_sentinel"]
out["section_6"] = section_6_stratified_sample(cur, sentinel_iso)
pg.close()
# JSON sidecar — strip non-serializables.
def _serialize(o):
if isinstance(o, datetime):
return o.isoformat()
return str(o)
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
OUT_PATH.write_text(json.dumps(out, indent=2, default=_serialize))
print(f"\nJSON sidecar written: {OUT_PATH}")
if __name__ == "__main__":
main()
@@ -1,296 +0,0 @@
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
sidecar for diffing across runs.
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
"""
import os
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
import psycopg2
from dotenv import load_dotenv
load_dotenv()
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
def normalize(label):
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
def folder_bin(source):
"""Classify source by type. stage_3_queue stores bare filenames, so we
bin by what kind of file it is, not where it lives in the tree."""
if not source:
return "unknown"
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
return "conversation" # bypasses Stage 2/3, will not appear here
s = source.lower()
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
return "voice_note"
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
return "dream_output"
if s.endswith(".md"):
return "markdown"
if s.endswith(".pdf"):
return "pdf"
if s.endswith(".docx") or s.endswith(".doc"):
return "docx"
if s.endswith(".pptx") or s.endswith(".ppt"):
return "pptx"
if s.endswith(".txt"):
return "txt"
return "other"
def fetch_rows(cur):
cur.execute("""
SELECT source, char_length, active_frames, worker_version, raw_metadata
FROM stage2_frames_v
""")
rows = []
for source, char_length, frames, worker_version, raw in cur.fetchall():
if not isinstance(frames, list):
continue
rows.append({
"source": source,
"char_length": char_length,
"frames": [str(f) for f in frames if f],
"worker_version": worker_version,
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
})
return rows
def section_frequency(rows):
counter = Counter()
for r in rows:
for f in r["frames"]:
counter[f] += 1
return counter
def section_hygiene(frequency):
"""Group raw labels by normalized form; flag collisions."""
groups = defaultdict(list)
for raw, count in frequency.items():
groups[normalize(raw)].append((raw, count))
collisions = {k: v for k, v in groups.items() if len(v) > 1}
return collisions
def section_per_doc_count(rows):
counts = Counter(len(r["frames"]) for r in rows)
return counts
def section_cooccurrence(rows, top_frames):
top_set = set(top_frames)
pair_counts = Counter()
for r in rows:
present = [f for f in r["frames"] if f in top_set]
for i in range(len(present)):
for j in range(i + 1, len(present)):
a, b = sorted([present[i], present[j]])
pair_counts[(a, b)] += 1
return pair_counts
def section_folder_crosstab(rows, top_frames):
top_set = set(top_frames)
table = defaultdict(Counter) # frame -> bin -> count
bin_totals = Counter()
for r in rows:
b = folder_bin(r["source"])
bin_totals[b] += 1
for f in r["frames"]:
if f in top_set:
table[f][b] += 1
return table, bin_totals
def section_worker_versions(rows):
counter = Counter(r["worker_version"] or "unknown" for r in rows)
raw_keys_by_version = defaultdict(Counter)
for r in rows:
v = r["worker_version"] or "unknown"
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
return counter, raw_keys_by_version
def section_data_gap(cur):
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
cur.execute("""
SELECT source, char_length
FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
missing = cur.fetchall()
by_bin = Counter(folder_bin(s) for s, _ in missing)
char_lengths = [c for _, c in missing]
return {
"count": len(missing),
"by_type_bin": dict(by_bin),
"char_length": {
"min": min(char_lengths) if char_lengths else None,
"max": max(char_lengths) if char_lengths else None,
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
},
"sample_sources": [s for s, _ in missing[:10]],
}
def section_corpus_coverage(cur):
"""How much of the embeddings corpus has frame coverage?"""
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
total = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM embeddings
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
""")
conversations = cur.fetchone()[0]
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
with_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
short_no_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE failed_at IS NOT NULL
""")
failed = cur.fetchone()[0]
return {
"total_distinct_sources_in_embeddings": total,
"conversations_no_frames_by_design": conversations,
"files_with_frames": with_frames,
"files_short_no_frames": short_no_frames,
"files_stage2_failed": failed,
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
}
def main():
conn = psycopg2.connect(os.environ["PG_DSN"])
cur = conn.cursor()
rows = fetch_rows(cur)
n_docs = len(rows)
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
# 1. Frequency
freq = section_frequency(rows)
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
for label, count in freq.most_common(30):
print(f" {count:5d} {label}")
print()
# 2. Hygiene
collisions = section_hygiene(freq)
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
print(f" '{norm}': {variant_str}")
print()
# 3. Per-doc frame count
per_doc = section_per_doc_count(rows)
print("--- 3. Per-doc frame count ---")
for n in sorted(per_doc):
print(f" {n} frames: {per_doc[n]} docs")
print()
# 4. Co-occurrence (top-K)
top_frames = [f for f, _ in freq.most_common(TOP_K)]
pairs = section_cooccurrence(rows, top_frames)
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
for (a, b), count in pairs.most_common(30):
print(f" {count:4d} {a} × {b}")
print()
# 5. Folder cross-tab
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
bins_sorted = [b for b, _ in bin_totals.most_common()]
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
for f in top_frames:
row_data = crosstab[f]
if not row_data:
continue
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
print(f" {f}: {cells}")
print()
# 6. Worker versions
versions, keys_by_version = section_worker_versions(rows)
print("--- 6. Worker version split ---")
for v, count in versions.most_common():
print(f" v{v}: {count} docs")
top_shapes = keys_by_version[v].most_common(3)
for keys, kcount in top_shapes:
print(f" {kcount} docs with keys={list(keys)}")
print()
# 7. Data gap
gap = section_data_gap(cur)
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
print(f" count: {gap['count']}")
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
print(f" by type bin: {gap['by_type_bin']}")
print(f" sample sources: {gap['sample_sources']}")
print()
# 8. Corpus coverage
coverage = section_corpus_coverage(cur)
print("--- 8. Corpus-wide frame coverage ---")
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
print(f" files with frames: {coverage['files_with_frames']}")
print(f" files short, no frames: {coverage['files_short_no_frames']}")
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
print()
# JSON sidecar
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
sidecar = {
"generated_at": datetime.now().isoformat(),
"n_docs_with_frames": n_docs,
"n_distinct_labels": len(freq),
"top_30_frames": freq.most_common(30),
"label_collisions": {
k: [(r, c) for r, c in v] for k, v in collisions.items()
},
"per_doc_frame_count": dict(per_doc),
"top_30_pairs": [
{"a": a, "b": b, "count": c}
for (a, b), c in pairs.most_common(30)
],
"folder_crosstab": {
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
},
"bin_totals": dict(bin_totals),
"worker_versions": dict(versions),
"data_gap": gap,
"corpus_coverage": coverage,
}
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
print(f"JSON sidecar written: {OUT_PATH}")
cur.close()
conn.close()
if __name__ == "__main__":
main()
-30
View File
@@ -1,30 +0,0 @@
"""
Aaron AI ingest_failures helpers — shared by watcher.py and ingest.py.
Both modules write structured failure rows so the SettingsPanel "Ingest Health"
view sees the same shape regardless of ingest path. Functions take an explicit
conn parameter; the caller decides transaction boundaries and exception
handling. Both current callers wrap with their own log-and-swallow shims.
"""
def record_ingest_failure(conn, source: str, filepath, error: str) -> None:
"""Insert or update an ingest_failures row. Commits."""
cur = conn.cursor()
cur.execute("""
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
VALUES (%s, %s, %s, 0, NOW(), NOW())
ON CONFLICT (source) DO UPDATE SET
error = EXCLUDED.error,
retry_count = ingest_failures.retry_count + 1,
last_failed_at = NOW(),
resolved = FALSE
""", (source, str(filepath), error[:1000]))
conn.commit()
def resolve_ingest_failure(conn, source: str) -> None:
"""Mark a previously failed source as resolved. Commits."""
cur = conn.cursor()
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,))
conn.commit()
+419 -63
View File
@@ -1,14 +1,44 @@
"""
Aaron AI — Graphiti Sidecar Service
Wraps graphiti-core in a FastAPI service to avoid asyncio event loop conflicts.
Aaron AI — Graphiti Sidecar Service (v2.0 — Pattern 1 async job model)
Wraps graphiti-core in a FastAPI service. Pattern 1 architecture: ingest
submission and completion are decoupled. Submitters POST to /episodes or
/episodes/bulk and receive a job_id; an in-process background worker
processes jobs serially against the graph; submitters poll GET /jobs/{id}
until terminal status.
Why Pattern 1: tonight's smoke test (2026-05-02) confirmed that bulk
ingest against the 4,222-entity graph commits successfully even when the
worker's HTTP read-timeout fires. The synchronous interface was producing
false-negative failures — work succeeded but the worker stopped listening.
Pattern 1 separates submission from completion observation so the worker
can't false-negative this way.
Architectural commitments:
- One in-flight job per sidecar (per graph). Concurrent jobs against the
same graph would race on graphiti-core's _resolve_nodes_and_edges_bulk
(no transaction boundary, no internal coordination). Concurrent
multi-tenancy is "run multiple sidecars," not "make one sidecar
concurrency-safe across graphs."
- Postgres-backed job state. Survives sidecar restart. On startup the
sidecar resets any 'running' rows to 'queued' (their previous run died);
the background worker picks them up naturally.
- Both /episodes and /episodes/bulk are async-shaped for parity. graphiti-
core operations underneath (add_episode, add_episode_bulk) are unchanged.
- The bulk pathway is preserved — load-bearing for first-run corpus
migration. Single-episode is preserved — load-bearing for state-
superseding content per the Stage 2/3 routing rule.
Port 8001 (internal only). No OpenAI dependency.
"""
import os, logging, sys, traceback
import os, logging, sys, asyncio, traceback, uuid, json
from contextlib import asynccontextmanager
from datetime import datetime
from pathlib import Path
import psycopg2
import psycopg2.extras
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
@@ -31,8 +61,18 @@ FALKORDB_PORT = int(os.getenv("FALKORDB_PORT", "6379"))
LLM_PROVIDER = os.getenv("LLM_PROVIDER", "anthropic")
LLM_MODEL = os.getenv("LLM_MODEL", "claude-sonnet-4-6")
LLM_API_KEY = os.getenv("LLM_API_KEY") or os.getenv("ANTHROPIC_API_KEY")
PG_DSN = os.getenv("PG_DSN")
SIDECAR_NAME = os.getenv("SIDECAR_NAME", "graphiti-sidecar-1")
os.environ["EMBEDDING_DIM"] = "384"
# Background worker configuration. Polls Postgres for queued jobs every
# WORKER_POLL_INTERVAL seconds when idle. Single-job-at-a-time by design;
# no concurrency primitive beyond the serial loop. The sleep is brief
# enough to feel responsive but long enough to avoid burning CPU on an
# empty queue.
WORKER_POLL_INTERVAL = 2.0
def get_llm_client():
from graphiti_core.llm_client.config import LLMConfig
config = LLMConfig(api_key=LLM_API_KEY, model=LLM_MODEL)
@@ -50,16 +90,286 @@ def get_llm_client():
return GroqClient(config)
raise ValueError(f"Unsupported LLM provider: {LLM_PROVIDER}")
graphiti_instance = None
async def get_graphiti():
if graphiti_instance is None:
raise HTTPException(status_code=503, detail="Graphiti not initialized")
return graphiti_instance
graphiti_instance = None
worker_task = None
# ---------------------------------------------------------------------------
# Postgres job-state helpers. Synchronous psycopg2 calls inside async
# functions: each call opens a fresh connection, runs one statement, closes.
# Acceptable here because traffic is low (single-digit jobs/min steady state)
# and the simplicity is worth more than connection pooling. If this ever
# becomes a bottleneck, swap to asyncpg or psycopg3 async.
# ---------------------------------------------------------------------------
def _pg():
return psycopg2.connect(PG_DSN)
def _job_insert(job_id: str, job_type: str, payload: dict) -> None:
"""Write a new job row in 'queued' status."""
pg = _pg()
cur = pg.cursor()
cur.execute(
"""
INSERT INTO graphiti_jobs (job_id, job_type, payload, status, submitted_by)
VALUES (%s, %s, %s::jsonb, 'queued', %s)
""",
(job_id, job_type, json.dumps(payload), SIDECAR_NAME),
)
pg.commit()
pg.close()
def _job_get(job_id: str) -> dict | None:
"""Read a single job by id. Returns None if not found."""
pg = _pg()
cur = pg.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute(
"""
SELECT job_id, job_type, status, enqueued_at, started_at, finished_at,
error, summary, submitted_by
FROM graphiti_jobs
WHERE job_id = %s
""",
(job_id,),
)
row = cur.fetchone()
pg.close()
if row is None:
return None
# Convert UUID, datetimes for JSON serialization
return {
"job_id": str(row["job_id"]),
"job_type": row["job_type"],
"status": row["status"],
"enqueued_at": row["enqueued_at"].isoformat() if row["enqueued_at"] else None,
"started_at": row["started_at"].isoformat() if row["started_at"] else None,
"finished_at": row["finished_at"].isoformat() if row["finished_at"] else None,
"error": row["error"],
"summary": row["summary"],
"submitted_by": row["submitted_by"],
}
def _job_claim_next() -> dict | None:
"""Atomically claim the oldest queued job for processing.
Uses SELECT ... FOR UPDATE SKIP LOCKED so multiple sidecar instances
(future multi-tenant deployment) don't fight over the same row. For
single-sidecar deployments this is just a clean atomic transition.
Returns the full job row (including payload) or None if queue is empty.
"""
pg = _pg()
cur = pg.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute(
"""
WITH next_job AS (
SELECT job_id
FROM graphiti_jobs
WHERE status = 'queued'
ORDER BY enqueued_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
UPDATE graphiti_jobs g
SET status = 'running', started_at = NOW()
FROM next_job
WHERE g.job_id = next_job.job_id
RETURNING g.job_id, g.job_type, g.payload
"""
)
row = cur.fetchone()
pg.commit()
pg.close()
if row is None:
return None
return {
"job_id": str(row["job_id"]),
"job_type": row["job_type"],
"payload": row["payload"], # already a dict via JSONB
}
def _job_complete(job_id: str, summary: dict) -> None:
pg = _pg()
cur = pg.cursor()
cur.execute(
"""
UPDATE graphiti_jobs
SET status = 'committed', finished_at = NOW(), summary = %s::jsonb
WHERE job_id = %s
""",
(json.dumps(summary), job_id),
)
pg.commit()
pg.close()
def _job_fail(job_id: str, error: str) -> None:
pg = _pg()
cur = pg.cursor()
cur.execute(
"""
UPDATE graphiti_jobs
SET status = 'failed', finished_at = NOW(), error = %s
WHERE job_id = %s
""",
(error[:2000], job_id), # truncate to keep error column reasonable
)
pg.commit()
pg.close()
def _startup_recovery() -> int:
"""Reset any 'running' jobs to 'queued' on startup.
Rationale: if the sidecar died while processing a job, that row is
stuck in 'running' with no process advancing it. The right behavior
on restart is to retry. graphiti-core's add_episode_bulk and
add_episode are idempotent against the graph (dedup handles duplicate
submission), so re-running a job is safe — at worst, a second run
incurs API spend on resolve calls that no-op against an already-
committed entity set.
Returns the count of recovered jobs.
"""
pg = _pg()
cur = pg.cursor()
cur.execute(
"""
UPDATE graphiti_jobs
SET status = 'queued', started_at = NULL
WHERE status = 'running'
"""
)
count = cur.rowcount
pg.commit()
pg.close()
return count
# ---------------------------------------------------------------------------
# Background worker — single asyncio task running for the sidecar lifetime.
# Processes one job at a time. No concurrency. Restart recovery is handled
# by _startup_recovery() before this task starts.
# ---------------------------------------------------------------------------
async def background_worker():
"""Serial job processor. Polls graphiti_jobs, processes one at a time."""
log.info("Background worker started")
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.bulk_utils import RawEpisode
while True:
try:
claimed = _job_claim_next()
if claimed is None:
await asyncio.sleep(WORKER_POLL_INTERVAL)
continue
job_id = claimed["job_id"]
job_type = claimed["job_type"]
payload = claimed["payload"]
log.info(f"Processing job {job_id} (type={job_type})")
start = datetime.now()
try:
if job_type == "bulk":
summary = await _process_bulk_job(payload, EpisodeType, RawEpisode)
elif job_type == "single":
summary = await _process_single_job(payload, EpisodeType)
else:
raise ValueError(f"Unknown job_type: {job_type}")
duration = (datetime.now() - start).total_seconds()
summary["duration_seconds"] = duration
_job_complete(job_id, summary)
log.info(f"Committed job {job_id} in {duration:.1f}s — {summary}")
except Exception as e:
duration = (datetime.now() - start).total_seconds()
err = f"{type(e).__name__}: {e}"
log.error(f"Job {job_id} failed after {duration:.1f}s: {err}\n{traceback.format_exc()}")
_job_fail(job_id, err)
except asyncio.CancelledError:
log.info("Background worker cancelled")
raise
except Exception as e:
# Defensive: don't let the worker loop die from an unexpected error.
# Log it, sleep briefly, continue.
log.error(f"Worker loop error: {e}\n{traceback.format_exc()}")
await asyncio.sleep(5.0)
async def _process_bulk_job(payload: dict, EpisodeType, RawEpisode) -> dict:
"""Run add_episode_bulk for a 'bulk' job. Payload mirrors BulkEpisodeRequest."""
raw_episodes = []
for ep in payload["episodes"]:
ref_time = (
datetime.fromisoformat(ep["timestamp"])
if ep.get("timestamp") else datetime.now()
)
raw_episodes.append(RawEpisode(
name=ep["name"],
content=ep["content"],
source_description=ep.get("source_description", ""),
source=EpisodeType.text,
reference_time=ref_time,
))
kwargs = dict(
bulk_episodes=raw_episodes,
group_id=payload.get("group_id") or GROUP_ID,
saga=payload.get("saga"),
)
if payload.get("custom_extraction_instructions") is not None:
kwargs["custom_extraction_instructions"] = payload["custom_extraction_instructions"]
result = await graphiti_instance.add_episode_bulk(**kwargs)
return {
"type": "bulk",
"episodes": len(result.episodes) if result and result.episodes else len(raw_episodes),
"nodes": len(result.nodes) if result and result.nodes else 0,
"edges": len(result.edges) if result and result.edges else 0,
}
async def _process_single_job(payload: dict, EpisodeType) -> dict:
"""Run add_episode for a 'single' job. Payload mirrors EpisodeRequest."""
ref_time = (
datetime.fromisoformat(payload["timestamp"])
if payload.get("timestamp") else datetime.now()
)
kwargs = dict(
name=payload["name"],
episode_body=payload["content"],
source=EpisodeType.text,
reference_time=ref_time,
source_description=payload.get("source_description", ""),
group_id=payload.get("group_id") or GROUP_ID,
custom_extraction_instructions=payload.get("custom_extraction_instructions"),
)
if payload.get("saga") is not None:
kwargs["saga"] = payload["saga"]
await graphiti_instance.add_episode(**kwargs)
return {"type": "single", "episodes": 1}
# ---------------------------------------------------------------------------
# Lifespan & app
# ---------------------------------------------------------------------------
@asynccontextmanager
async def lifespan(app: FastAPI):
global graphiti_instance
global graphiti_instance, worker_task
sys.path.insert(0, str(Path.home() / "aaronai" / "scripts"))
log.info("Loading embedding and reranker models...")
from st_embedder import SentenceTransformerEmbedder
@@ -75,22 +385,51 @@ async def lifespan(app: FastAPI):
max_coroutines=2,
)
await graphiti_instance.build_indices_and_constraints()
# Bridge driver._search_ops to driver.search_interface — graphiti-core 0.29.0
# builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__
# but never assigns it to driver.search_interface. search_utils.py dispatches
# on driver.search_interface; without this assignment it falls back to
# interpreted-Cypher cosine math (full table scans). Together with the
# vendored patches in graphiti_patches/, this activates FalkorDB's native
# vector index for entity dedup similarity search.
if (hasattr(graphiti_instance.driver, "_search_ops")
and graphiti_instance.driver.search_interface is None):
# PATCHED 2026-05-02: bridge the per-driver SearchOperations to the
# search_interface attribute that search_utils.py dispatches on.
# graphiti-core 0.29.0 builds FalkorSearchOperations as driver._search_ops
# but never assigns it to driver.search_interface — naming mismatch
# between the two halves of the codebase. Without this, search_utils.py
# falls through to interpreted-Cypher cosine math (full-table scan) even
# when our patched FalkorSearchOperations exists. Setting search_interface
# activates the per-driver vector-index path.
if hasattr(graphiti_instance.driver, '_search_ops') and graphiti_instance.driver.search_interface is None:
graphiti_instance.driver.search_interface = graphiti_instance.driver._search_ops
log.info("Wired driver.search_interface = driver._search_ops (vector index path active)")
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
# Recover any jobs left 'running' from a previous sidecar instance.
# They become 'queued' again and the background worker picks them up.
recovered = _startup_recovery()
if recovered > 0:
log.info(f"Startup recovery: reset {recovered} running job(s) to queued")
# Start the background job worker.
worker_task = asyncio.create_task(background_worker())
log.info("Sidecar ready — accepting job submissions on :8001")
yield
# Shutdown: cancel worker, close graphiti.
if worker_task is not None:
worker_task.cancel()
try:
await worker_task
except asyncio.CancelledError:
pass
await graphiti_instance.close()
app = FastAPI(title="Aaron AI Graphiti Sidecar", lifespan=lifespan)
app = FastAPI(title="Aaron AI Graphiti Sidecar (Pattern 1)", lifespan=lifespan)
# ---------------------------------------------------------------------------
# Request models — preserved from v1.0 with no payload-shape changes. The
# only API change is the response shape: instead of blocking until
# graphiti-core returns, submission endpoints return a job_id immediately.
# ---------------------------------------------------------------------------
class BulkEpisodeItem(BaseModel):
name: str
@@ -103,6 +442,7 @@ class BulkEpisodeRequest(BaseModel):
episodes: list[BulkEpisodeItem]
group_id: str | None = None
saga: str | None = None
custom_extraction_instructions: str | None = None
class EpisodeRequest(BaseModel):
@@ -112,63 +452,78 @@ class EpisodeRequest(BaseModel):
timestamp: str | None = None
group_id: str | None = None
custom_extraction_instructions: str | None = None
saga: str | None = None
# ---------------------------------------------------------------------------
# Endpoints
# ---------------------------------------------------------------------------
@app.get("/health")
async def health():
return {"ok": True, "provider": LLM_PROVIDER, "group": GROUP_ID}
return {
"ok": True,
"provider": LLM_PROVIDER,
"group": GROUP_ID,
"sidecar": SIDECAR_NAME,
"version": "2.0",
}
@app.post("/episodes")
async def add_episode(req: EpisodeRequest):
g = await get_graphiti()
from graphiti_core.nodes import EpisodeType
try:
ref_time = datetime.fromisoformat(req.timestamp) if req.timestamp else datetime.now()
await g.add_episode(
name=req.name,
episode_body=req.content,
source=EpisodeType.text,
reference_time=ref_time,
source_description=req.source_description,
group_id=req.group_id or GROUP_ID,
custom_extraction_instructions=req.custom_extraction_instructions,
)
return {"ok": True}
except Exception as e:
log.error(f"Episode ingestion failed: {e}\n{traceback.format_exc()}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/episodes/bulk")
async def add_episodes_bulk(req: BulkEpisodeRequest):
g = await get_graphiti()
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.bulk_utils import RawEpisode
raw_episodes = []
for ep in req.episodes:
ref_time = datetime.fromisoformat(ep.timestamp) if ep.timestamp else datetime.now()
raw_episodes.append(RawEpisode(
name=ep.name,
content=ep.content,
source_description=ep.source_description,
source=EpisodeType.text,
reference_time=ref_time,
))
async def submit_bulk(req: BulkEpisodeRequest):
"""Submit a bulk ingest job. Returns job_id for polling.
Job is processed serially by the sidecar's background worker; one
bulk-or-single job at a time per graph. No HTTP read-timeout
blocking. Submitter polls GET /jobs/{job_id} until terminal status.
"""
if graphiti_instance is None:
raise HTTPException(status_code=503, detail="Graphiti not initialized")
job_id = str(uuid.uuid4())
payload = req.model_dump()
try:
result = await g.add_episode_bulk(
bulk_episodes=raw_episodes,
group_id=req.group_id or GROUP_ID,
saga=req.saga or None,
)
return {"ok": True, "count": len(raw_episodes)}
_job_insert(job_id, "bulk", payload)
except Exception as e:
log.error(f"Bulk ingestion failed: {e}\n{traceback.format_exc()}")
raise HTTPException(status_code=500, detail=str(e))
log.error(f"Failed to enqueue bulk job: {e}\n{traceback.format_exc()}")
raise HTTPException(status_code=500, detail=f"Job enqueue failed: {e}")
return {"job_id": job_id, "status": "queued"}
@app.post("/episodes")
async def submit_single(req: EpisodeRequest):
"""Submit a single-episode ingest job. Returns job_id for polling."""
if graphiti_instance is None:
raise HTTPException(status_code=503, detail="Graphiti not initialized")
job_id = str(uuid.uuid4())
payload = req.model_dump()
try:
_job_insert(job_id, "single", payload)
except Exception as e:
log.error(f"Failed to enqueue single job: {e}\n{traceback.format_exc()}")
raise HTTPException(status_code=500, detail=f"Job enqueue failed: {e}")
return {"job_id": job_id, "status": "queued"}
@app.get("/jobs/{job_id}")
async def get_job(job_id: str):
"""Poll a job's status. Returns 404 if job not found."""
job = _job_get(job_id)
if job is None:
raise HTTPException(status_code=404, detail=f"Job {job_id} not found")
return job
@app.get("/search")
async def search(query: str, limit: int = 8, group_id: str | None = None):
g = await get_graphiti()
if graphiti_instance is None:
raise HTTPException(status_code=503, detail="Graphiti not initialized")
try:
results = await g.search(
results = await graphiti_instance.search(
query=query,
num_results=limit,
group_ids=[group_id or GROUP_ID],
@@ -189,6 +544,7 @@ async def search(query: str, limit: int = 8, group_id: str | None = None):
log.error(f"Search failed: {e}\n{traceback.format_exc()}")
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="127.0.0.1", port=8001, log_level="info")
+131 -131
View File
@@ -1,37 +1,70 @@
"""
Aaron AI bulk ingester. Two entry points:
- ingest_directory(folder, embedder=None) — programmatic; called from
api.py /api/reindex with the api process's shared embedder
- python3 scripts/ingest.py <folder> — CLI back-compat; loads its own embedder
Stage 1 helpers (extract / chunk / embed / write) live in scripts/encoding.py.
Failure tracking SQL lives in scripts/failures.py.
"""
import os
import sys
import hashlib
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
import psycopg2.extras
import json
from sentence_transformers import SentenceTransformer
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
from failures import (
record_ingest_failure as _record_failure_sql,
resolve_ingest_failure as _resolve_failure_sql,
)
from docx import Document
from pypdf import PdfReader
from pptx import Presentation
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
PG_DSN = os.getenv("PG_DSN")
print("Loading embedding model...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
PG_DSN = os.getenv("PG_DSN")
def get_pg():
return psycopg2.connect(PG_DSN)
def extract_text_from_docx(path):
doc = Document(path)
return "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
def extract_text_from_pdf(path):
reader = PdfReader(path)
text = ""
for page in reader.pages:
extracted = page.extract_text()
if extracted:
text += extracted + "\n"
return text
def extract_text_from_pptx(path):
prs = Presentation(path)
text = ""
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text") and shape.text.strip():
text += shape.text + "\n"
return text
def extract_text_from_txt(path):
with open(path, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
if chunk.strip():
chunks.append(chunk)
start += chunk_size - overlap
return chunks
def make_id(filepath, chunk_index):
path_hash = hashlib.md5(str(filepath).encode()).hexdigest()[:8]
return f"{path_hash}_{chunk_index}"
def enqueue_stage2(source, full_text):
"""Enqueue document for Stage 2 (Mistral orientation) -> Stage 3 (Graphiti ingest).
"""Enqueue document for Stage 2 (Mistral orientation) Stage 3 (Graphiti ingest).
TEMPORARY: this queue feed will be removed when pgvector is decommissioned
and the watcher calls Stage 2 directly.
"""
@@ -54,127 +87,94 @@ def enqueue_stage2(source, full_text):
except Exception as e:
print(f" Stage 2 queue insert failed (non-fatal): {e}")
def ingest_file(filepath):
path = Path(filepath)
suffix = path.suffix.lower()
def _record_failure(filepath: Path, error: str) -> None:
try:
pg = get_pg()
try:
_record_failure_sql(pg, filepath.name, filepath, error)
finally:
pg.close()
except Exception as e:
print(f" Could not record ingest failure (non-fatal): {e}")
def _resolve_failure(source: str) -> None:
try:
pg = get_pg()
try:
_resolve_failure_sql(pg, source)
finally:
pg.close()
except Exception as e:
print(f" Could not resolve ingest failure record (non-fatal): {e}")
IGNORED_TOP_FOLDERS = {"Drafts"}
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
# "~" catches Office lock files (~$) including the case where Nextcloud
# filesystem encoding has mangled the "$" to a unicode replacement char.
if filepath.name.startswith(("~", ".")):
if path.name.startswith("~$") or path.name.startswith("."):
return 0
if filepath.suffix.lower() not in SUPPORTED:
return 0
if root is not None:
try:
rel = filepath.parent.relative_to(root)
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
return 0
except ValueError:
pass
blocks = extract_blocks(filepath)
if not blocks or not any(
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
for b in blocks
):
_record_failure(filepath, "Text extraction failed or empty")
return 0
folder_rel = None
if root is not None:
try:
folder_rel = str(filepath.parent.relative_to(root))
except ValueError:
pass
try:
rows = chunk_and_embed(blocks, filepath.name, embedder,
filepath=filepath, folder=folder_rel)
except Exception as e:
_record_failure(filepath, f"Embedding failed: {e}")
return 0
if not rows:
return 0
try:
pg = get_pg()
try:
write_embeddings_batch(pg, rows)
finally:
pg.close()
except Exception as e:
_record_failure(filepath, f"pgvector write failed: {e}")
return 0
print(f" Indexed {len(rows)} chunks: {filepath.name}")
_resolve_failure(filepath.name)
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
full_text = "\n".join(
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
for b in blocks
)
enqueue_stage2(filepath.name, full_text)
return len(rows)
def ingest_directory(folder, embedder=None) -> dict:
"""Programmatic entry point. Returns {scanned, ingested, failed, total_chunks}.
If embedder is None, loads its own SentenceTransformer (CLI back-compat path).
Caller (e.g. api.py /api/reindex) should pass its module-level embedder so
the ~200MB model isn't reloaded per call.
"""
folder = Path(folder)
if not folder.exists():
return {"scanned": 0, "ingested": 0, "failed": 0, "total_chunks": 0,
"error": f"folder not found: {folder}"}
if embedder is None:
print("Loading embedding model...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
files = [f for f in folder.rglob("*")
if f.suffix.lower() in SUPPORTED
and not f.name.startswith(("~$", "."))]
print(f"Found {len(files)} files to process")
ingested = failed = total_chunks = 0
for f in files:
n = _ingest_one(f, embedder, root=folder)
if n > 0:
ingested += 1
total_chunks += n
if suffix == ".docx":
text = extract_text_from_docx(path)
elif suffix == ".pdf":
text = extract_text_from_pdf(path)
elif suffix == ".pptx":
text = extract_text_from_pptx(path)
elif suffix in [".txt", ".md"]:
text = extract_text_from_txt(path)
else:
failed += 1
return {"scanned": len(files), "ingested": ingested, "failed": failed,
"total_chunks": total_chunks}
return 0
if not text.strip():
return 0
chunks = chunk_text(text)
if not chunks:
return 0
embeddings = embedder.encode(chunks).tolist()
ids = [make_id(path, i) for i in range(len(chunks))]
metadatas = [{
"source": path.name,
"filepath": str(path),
"folder": str(path.parent.relative_to(Path(sys.argv[1]) if len(sys.argv) > 1 else path.parent))
} for _ in chunks]
# STAGE 1: Write to pgvector (TEMPORARY — remove when chat agent migrates to Graphiti)
pg = get_pg()
cur = pg.cursor()
for chunk_id, chunk, embedding, meta in zip(ids, chunks, embeddings, metadatas):
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
metadata = EXCLUDED.metadata
""", (
chunk_id, chunk, embedding,
meta.get("source"), "document", None,
json.dumps(meta)
))
pg.commit()
pg.close()
print(f" Indexed {len(chunks)} chunks: {path.name}")
# Enqueue for Stage 2 → Stage 3 (Graphiti pipeline)
# SKIP_STAGE2_ENQUEUE env var set by migration scripts to prevent bulk enqueue
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
enqueue_stage2(path.name, text)
return len(chunks)
except Exception as e:
print(f" Error: {path.name}: {e}")
return 0
def ingest_folder(folder_path):
"""CLI back-compat wrapper. Loads its own embedder."""
result = ingest_directory(Path(folder_path))
print(f"\nDone. {result['ingested']} files / {result['total_chunks']} chunks indexed; "
f"{result['failed']} failed.")
folder = Path(folder_path)
if not folder.exists():
print(f"Folder not found: {folder_path}")
sys.exit(1)
supported = [".docx", ".pdf", ".pptx", ".txt", ".md"]
files = [f for f in folder.rglob("*")
if f.suffix.lower() in supported
and not f.name.startswith("~$")
and not f.name.startswith(".")]
if not files:
print("No supported files found.")
sys.exit(1)
print(f"Found {len(files)} files to process\n")
total_chunks = 0
for f in files:
total_chunks += ingest_file(f)
print(f"\nDone. Total chunks indexed: {total_chunks}")
if __name__ == "__main__":
target = sys.argv[1] if len(sys.argv) > 1 else str(Path.home() / "aaronai" / "docs")
+3 -18
View File
@@ -18,14 +18,8 @@ CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
PG_DSN = os.getenv("PG_DSN")
MIN_EXCHANGES = 3
_embedder = None
def get_embedder():
global _embedder
if _embedder is None:
print("Loading embedding model...")
_embedder = SentenceTransformer("all-MiniLM-L6-v2")
return _embedder
print("Loading embedding model...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def get_conversations():
conn = sqlite3.connect(CONVERSATIONS_DB)
@@ -129,18 +123,9 @@ def run():
# Embed and insert
texts = [c[1] for c in new_chunks]
embeddings = get_embedder().encode(texts, show_progress_bar=False).tolist()
embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
if not meta.get("type"):
raise ValueError(
f"chunk {chunk_id!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
# ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
# COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
# which advances on activity. Re-running this script on an active conv
# should refresh the timestamp, not preserve the first-seen one.
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
-136
View File
@@ -1,136 +0,0 @@
"""
Orientation Indexer — feeds Stage 2's document-level orientations into pgvector
so they're searchable alongside chunk text by the retrieve_documents tool.
Each completed row in stage_3_queue has an `orientation` string (active_frames
+ frame_relationships + extraction_orientation + one_sentence_summary) that
describes the document at a conceptual level. Indexing it as its own row in
the embeddings table gives the cross-encoder a second surface to rank against
"what is this document about" rather than just "what does this chunk say."
This worker is part of the "read-only Graphiti + orientation-into-pgvector"
plan B that replaced the Stage 3 → Graphiti write path. The graph layer is
queried directly via the search_facts chat tool; orientations land here.
State tracking: a row is considered indexed if the embeddings table already
holds a row with source=<source> and metadata->>'kind'='orientation'. The
worker is idempotent — restart-safe, resumable.
Runs as systemd: aaronai-orientation-indexer.service
"""
import logging
import os
import sys
import time
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
from sentence_transformers import SentenceTransformer
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
sys.path.insert(0, str(Path(__file__).parent))
from encoding import write_embeddings_batch
PG_DSN = os.getenv("PG_DSN")
EMBED_MODEL = "all-MiniLM-L6-v2"
BATCH_SIZE = 25
POLL_INTERVAL_SECS = 30
LOG_FILE = "/var/log/aaronai/orientation-indexer.log"
HEARTBEAT_FILE = "/var/log/aaronai/orientation-indexer-heartbeat"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [orientation-indexer] %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE, mode="a")],
)
log = logging.getLogger("orientation-indexer")
def get_pg():
return psycopg2.connect(PG_DSN)
def fetch_unindexed(cur, limit):
"""Pull stage_3_queue rows with a non-null orientation whose orientation
hasn't been written to the embeddings table yet."""
cur.execute(
"""
SELECT s.source, s.orientation
FROM stage_3_queue s
WHERE s.orientation IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM embeddings e
WHERE e.source = s.source
AND e.metadata->>'kind' = 'orientation'
)
ORDER BY s.enqueued_at
LIMIT %s
""",
(limit,),
)
return cur.fetchall()
def _row_for(source: str, orientation: str, embedding) -> dict:
"""Build an embeddings row for the orientation. id is deterministic so
re-runs don't create duplicates if the unique check above ever races."""
import hashlib
chunk_id = hashlib.md5(f"orientation:{source}".encode()).hexdigest()[:8] + "_orient"
return {
"id": chunk_id,
"document": orientation,
"embedding": embedding,
"source": source,
"type": "document",
"metadata": {
"source": source,
"kind": "orientation",
},
}
def write_heartbeat():
try:
Path(HEARTBEAT_FILE).write_text(str(time.time()))
except Exception:
pass
def main():
log.info("Orientation indexer starting...")
log.info(f"Loading embedding model: {EMBED_MODEL}")
embedder = SentenceTransformer(EMBED_MODEL)
log.info("Embedding model ready.")
while True:
write_heartbeat()
try:
pg = get_pg()
try:
cur = pg.cursor()
rows = fetch_unindexed(cur, BATCH_SIZE)
if not rows:
pg.close()
time.sleep(POLL_INTERVAL_SECS)
continue
orientations = [r[1] for r in rows]
embeddings = embedder.encode(orientations).tolist()
batch = [
_row_for(source, orient, emb)
for (source, orient), emb in zip(rows, embeddings)
]
write_embeddings_batch(pg, batch)
log.info(f"Indexed {len(batch)} orientation(s)")
finally:
pg.close()
except Exception as e:
log.error(f"Indexing loop iteration failed: {e}")
time.sleep(POLL_INTERVAL_SECS)
if __name__ == "__main__":
main()
-146
View File
@@ -1,146 +0,0 @@
"""One-off: re-ingest docx+pptx after the 2026-05-04 extractor upgrade (commit 93c0d89).
Pre-upgrade extraction missed tables, headers/footers, text boxes, group shapes,
and pptx notes — leaving CVs/dossiers as section-header skeletons in the index.
Steps when run with --apply:
1. DELETE all embeddings rows where source ends in .docx or .pptx
2. Walk NEXTCLOUD_PATH and re-ingest every .docx/.pptx via _ingest_one
3. Stage 2 enqueue is suppressed (SKIP_STAGE2_ENQUEUE=1)
Without --apply: dry-run. Counts files and chunks, prints a sample, writes nothing.
"""
import os
import re
import sys
import time
from pathlib import Path
os.environ["SKIP_STAGE2_ENQUEUE"] = "1"
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
import psycopg2
from sentence_transformers import SentenceTransformer
sys.path.insert(0, str(Path(__file__).parent))
from ingest import _ingest_one, get_pg
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
APPLY = "--apply" in sys.argv
_ext_args = [a for a in sys.argv[1:] if a.startswith("--ext=")]
if _ext_args:
TARGET_EXTS = {("." + e.lstrip(".")) for arg in _ext_args
for e in arg.split("=", 1)[1].split(",")}
else:
TARGET_EXTS = {".docx", ".pptx"}
def _ext_regex():
inner = "|".join(re.escape(e.lstrip(".")) for e in sorted(TARGET_EXTS))
return f"\\.({inner})$"
def count_stale():
pg = get_pg()
cur = pg.cursor()
cur.execute(
f"SELECT lower(substring(source from '\\.[^.]+$')) AS ext, "
f"COUNT(DISTINCT source) AS files, COUNT(*) AS chunks "
f"FROM embeddings WHERE lower(source) ~ '{_ext_regex()}' "
f"GROUP BY 1 ORDER BY 1"
)
rows = cur.fetchall()
pg.close()
return rows
def delete_stale():
pg = get_pg()
cur = pg.cursor()
cur.execute(f"DELETE FROM embeddings WHERE lower(source) ~ '{_ext_regex()}'")
deleted = cur.rowcount
pg.commit()
pg.close()
return deleted
def find_files():
files = []
for f in NEXTCLOUD_PATH.rglob("*"):
if not f.is_file():
continue
if f.suffix.lower() not in TARGET_EXTS:
continue
if f.name.startswith(("~$", ".")):
continue
files.append(f)
return files
def main():
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
print(f"Target: {NEXTCLOUD_PATH}")
print(f"Extensions: {sorted(TARGET_EXTS)}")
print(f"SKIP_STAGE2_ENQUEUE={os.environ.get('SKIP_STAGE2_ENQUEUE')}")
print()
print("Stale chunks currently in DB:")
for ext, files, chunks in count_stale():
print(f" {ext}: {files} files, {chunks} chunks")
print()
files = find_files()
by_ext = {}
for f in files:
by_ext.setdefault(f.suffix.lower(), []).append(f)
print(f"Files on disk to re-ingest:")
for ext, lst in sorted(by_ext.items()):
print(f" {ext}: {len(lst)} files")
print(f" total: {len(files)}")
print()
print("Sample (5 random):")
import random
for f in random.sample(files, min(5, len(files))):
print(f" {f}")
print()
if not APPLY:
print("Dry-run only. Re-run with --apply to delete + re-ingest.")
return
print("Deleting stale chunks...")
n = delete_stale()
print(f" deleted {n} rows")
print()
print("Loading embedder...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
print()
print(f"Re-ingesting {len(files)} files...")
started = time.time()
ingested = failed = total_chunks = 0
for i, f in enumerate(files, 1):
n = _ingest_one(f, embedder, root=NEXTCLOUD_PATH)
if n > 0:
ingested += 1
total_chunks += n
else:
failed += 1
if i % 25 == 0 or i == len(files):
elapsed = time.time() - started
rate = i / elapsed if elapsed else 0
print(f" [{i}/{len(files)}] ingested={ingested} failed={failed} "
f"chunks={total_chunks} ({rate:.1f} files/s)")
elapsed = time.time() - started
print()
print(f"Done in {elapsed:.0f}s: {ingested} ingested, {failed} failed, "
f"{total_chunks} chunks written.")
if __name__ == "__main__":
main()
+144 -22
View File
@@ -1,12 +1,19 @@
#!/usr/bin/env python3
"""
Stage 2 Worker — Taxonomy-Free Mistral Orientation
Polls stage_2_queue, runs Mistral taxonomy-free pass, enqueues Stage 3.
Runs as systemd service: aaronai-stage2.service
Stage 2 Worker — Taxonomy-Free Mistral Orientation + State-Type Classification
Polls stage_2_queue, runs Mistral pass that produces:
(a) orientation context (active frames, frame relationships, extraction focus)
(b) state-type classification for Stage 3 routing (current/reference/historical,
supersedes_prior_state boolean, confidence, rationale)
Enqueues Stage 3 with both concerns as explicit columns.
Routing:
- char_length < 2000 → skip Stage 3, mark complete (sparse content, cascade no benefit)
- char_length >= 2000 → enqueue Stage 3 with orientation metadata
- char_length >= 2000 → enqueue Stage 3 with orientation + routing metadata
Runs as systemd service: aaronai-stage2.service
"""
import os, json, time, subprocess, logging, requests
@@ -33,22 +40,68 @@ CHAR_LENGTH_THRESHOLD = 2000
REQUEST_TIMEOUT = 300
RETRY_ATTEMPTS = 2
POLL_INTERVAL = 5
WORKER_VERSION = "2.1"
WORKER_VERSION = "2.2"
# Valid values for state-type fields. Mistral output validated against these;
# anything outside falls through to safe-cheap defaults (bulk routing).
VALID_STATE_TYPES = ("current", "reference", "historical")
VALID_CONFIDENCE = ("low", "medium", "high")
# Safe-cheap defaults applied when Mistral output is missing or malformed.
# All route to bulk pathway (no temporal invalidation cost) per Phase A
# routing rule: route to single-episode only on supersedes_prior_state=true
# AND confidence in {medium, high}.
DEFAULT_STATE_TYPE = "reference"
DEFAULT_CONFIDENCE = "low"
DEFAULT_SUPERSEDES = False
DEFAULT_RATIONALE = "mistral output missing or malformed; default applied"
TAXFREE_PROMPT = (
"You are a metadata extraction system. Given a document, describe its content "
"shape for use as orientation context in a knowledge graph extraction pass.\n\n"
"Do not summarize content. Do not extract entities. Do not assign a single category label.\n\n"
"Instead, describe:\n"
"- What domains or frames are active in this content (there may be several simultaneously)\n"
"- How those frames relate to each other in this specific document\n"
"- What kind of relational content a knowledge graph extractor should look for\n\n"
"Output JSON only. No prose, no explanation, no markdown.\n\n"
"Schema:\n"
"You are a metadata extraction system. Given a document, produce a JSON object "
"describing two distinct concerns about the document. Output JSON only — no prose, "
"no explanation, no markdown.\n\n"
"CONCERN 1 — ORIENTATION CONTEXT (for downstream knowledge-graph extraction):\n"
"Describe the content shape. Do not summarize content. Do not extract entities. "
"Do not assign a single category label.\n"
" - active_frames: which domains or frames are active in this content (there may "
"be several simultaneously)\n"
" - frame_relationships: how those frames relate in this specific document, "
"one sentence\n"
" - extraction_orientation: what kind of relational content a knowledge-graph "
"extractor should look for, one sentence\n"
" - one_sentence_summary: a single-sentence content summary\n\n"
"CONCERN 2 — STATE-TYPE CLASSIFICATION (for ingest routing):\n"
"Classify the document's relationship to time and prior facts. This is independent "
"of orientation: a document can be in a 'reference frame' (orientation) while "
"describing 'current state' (state-type), or vice versa. Judge the document's "
"ROLE, not its topic.\n"
" - state_type: one of\n"
" 'current' — describes the author's present state, recent decisions, or "
"ongoing situations as of the document's date\n"
" 'reference' — timeless or slow-changing material: external books, "
"documentation, technical reference, conceptual writing\n"
" 'historical' — describes past events, prior states, or archived material "
"the author is recording but not living in\n"
" - state_type_confidence: 'low' | 'medium' | 'high' — how confident you are in "
"the classification. Use 'low' when genuinely uncertain.\n"
" - supersedes_prior_state: true if this document describes facts that should "
"REPLACE previously-recorded facts about the same subjects (e.g. a journal entry "
"saying 'I no longer work at X', a status update, a corrected belief). false "
"otherwise. Default to false when uncertain.\n"
" - state_type_rationale: one sentence explaining the classification\n\n"
"Output schema (flat, all eight fields at the top level):\n"
'{"active_frames": ["<frame 1>", "<frame 2>"], '
'"frame_relationships": "<one sentence>", '
'"extraction_orientation": "<one sentence>", '
'"one_sentence_summary": "<one sentence>"}\n\n'
'"one_sentence_summary": "<one sentence>", '
'"state_type": "current|reference|historical", '
'"state_type_confidence": "low|medium|high", '
'"supersedes_prior_state": true|false, '
'"state_type_rationale": "<one sentence>"}\n\n'
"Document:\n"
)
@@ -100,6 +153,38 @@ def run_mistral(doc_text):
return {"error": "parse_failed", "raw": raw[:200]}
def normalize_state_fields(meta):
"""Validate and normalize the four state-type fields from Mistral output.
Anything missing or malformed falls through to safe-cheap defaults that
route to bulk pathway (no temporal invalidation work)."""
raw_state_type = meta.get("state_type")
if isinstance(raw_state_type, str) and raw_state_type.lower() in VALID_STATE_TYPES:
state_type = raw_state_type.lower()
else:
state_type = DEFAULT_STATE_TYPE
raw_conf = meta.get("state_type_confidence")
if isinstance(raw_conf, str) and raw_conf.lower() in VALID_CONFIDENCE:
confidence = raw_conf.lower()
else:
confidence = DEFAULT_CONFIDENCE
raw_supersedes = meta.get("supersedes_prior_state")
if isinstance(raw_supersedes, bool):
supersedes = raw_supersedes
else:
supersedes = DEFAULT_SUPERSEDES
raw_rationale = meta.get("state_type_rationale")
if isinstance(raw_rationale, str) and raw_rationale.strip():
rationale = raw_rationale.strip()[:1000]
else:
rationale = DEFAULT_RATIONALE
return state_type, confidence, supersedes, rationale
def build_orientation(meta):
frames = ", ".join(meta.get("active_frames", []))
rel = meta.get("frame_relationships", "")
@@ -108,20 +193,46 @@ def build_orientation(meta):
return f"Active frames: {frames}. Frame relationships: {rel} Extraction focus: {orient} Summary: {summary}"
def enqueue_stage3(pg, source, full_text, orientation, metadata):
def enqueue_stage3(pg, source, full_text, orientation, metadata,
state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale):
"""Write Stage 3 queue row with orientation + explicit routing columns.
Routing columns (state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale) are first-class queue properties for Phase A.
Stage 3 reads them on every dequeue to choose bulk vs single-episode pathway.
The full Mistral metadata blob is also retained in stage2_metadata JSON for
debugging and future cycle work."""
cur = pg.cursor()
cur.execute("""
INSERT INTO stage_3_queue (source, full_text, orientation, stage2_metadata)
VALUES (%s, %s, %s, %s)
INSERT INTO stage_3_queue (
source, full_text, orientation, stage2_metadata,
state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale
)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (source) DO UPDATE SET
full_text = EXCLUDED.full_text,
orientation = EXCLUDED.orientation,
stage2_metadata = EXCLUDED.stage2_metadata,
state_type = EXCLUDED.state_type,
state_type_confidence = EXCLUDED.state_type_confidence,
supersedes_prior_state = EXCLUDED.supersedes_prior_state,
state_type_rationale = EXCLUDED.state_type_rationale,
enqueued_at = NOW(),
-- Reset all run-state fields on re-enqueue. Without this,
-- stale started_at from a prior attempt makes the row
-- invisible to the Stage 3 worker's claim filter (which
-- typically uses started_at IS NULL).
started_at = NULL,
completed_at = NULL,
failed_at = NULL,
failure_reason = NULL,
external_job_id = NULL,
attempts = 0
""", (source, full_text, orientation, json.dumps(metadata)))
""", (source, full_text, orientation, json.dumps(metadata),
state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale))
pg.commit()
@@ -144,7 +255,7 @@ def process_one(row):
return True
# Run Mistral
log.info(f" Running Mistral taxonomy-free pass...")
log.info(f" Running Mistral taxonomy-free + state-type pass...")
try:
meta = run_mistral(full_text)
except requests.exceptions.Timeout:
@@ -177,14 +288,25 @@ def process_one(row):
frames = meta.get("active_frames", [])
log.info(f" Frames: {frames}")
# Normalize state-type fields with safe-cheap defaults on malformed output.
# Note: Mistral may return valid orientation but malformed state-type;
# we accept the orientation and default the routing rather than fail
# the whole row, since defaults route to bulk (cheap, safe).
state_type, confidence, supersedes, rationale = normalize_state_fields(meta)
log.info(
f" State-type: {state_type} (conf={confidence}, "
f"supersedes={supersedes})"
)
orientation = build_orientation(meta)
meta["_model"] = "mistral:latest"
meta["_worker_version"] = WORKER_VERSION
meta["_generated_at"] = datetime.now().isoformat()
meta["char_length"] = char_length
# Enqueue Stage 3
enqueue_stage3(pg, source, full_text, orientation, meta)
# Enqueue Stage 3 with explicit routing columns
enqueue_stage3(pg, source, full_text, orientation, meta,
state_type, confidence, supersedes, rationale)
cur.execute("UPDATE stage_2_queue SET completed_at = NOW() WHERE id = %s", (row_id,))
pg.commit()
pg.close()
+319 -34
View File
@@ -1,22 +1,58 @@
#!/usr/bin/env python3
"""
Stage 3 Worker — Graphiti Ingest with Taxonomy-Free Orientation
Polls stage_3_queue, chunks documents, ingests as episodic saga to Graphiti.
Stage 3 Worker — Graphiti Ingest with Bulk-vs-Single-Episode Routing
+ Encoder Instructions (v1.0)
Chunking rationale: Large documents sent as single episodes cause FalkorDB
write lock contention during entity deduplication. Chunking at ~500 words
(matching Stage 1) produces smaller deduplication passes that don't block.
Each document's chunks are linked via Graphiti's saga mechanism, preserving
document structure in the graph.
Polls stage_3_queue, routes each row to one of two ingest pathways based on
state-type classification produced by Stage 2:
Saga-size limit (MAX_CHUNKS_PER_SAGA): 2026-05-01 incident showed sagas of
17 and 19 chunks deadlock the sidecar's Python-side coordination. Documents
producing more than MAX_CHUNKS_PER_SAGA chunks are split into multiple bulk
commits, each tagged with the same saga value so Graphiti still links them.
- BULK pathway (existing): supersedes_prior_state=false OR confidence=low
OR routing fields missing. Fast, no temporal invalidation.
Wedge detection: 2026-05-01 incident also surfaced the asymmetry with Stage 2 —
Stage 3 had no recovery path when the sidecar deadlocked. Now mirrors Stage 2's
consecutive_failures pattern with sidecar restart on threshold.
- SINGLE-EPISODE pathway (new): supersedes_prior_state=true AND
confidence in {medium, high}. Per-chunk POST to /episodes with shared
saga tag, full edge invalidation, per-chunk timeout/retry independence.
Both pathways pass EXTRACTION_INSTRUCTIONS_V1 to the sidecar via
custom_extraction_instructions, which graphiti-core inserts into entity
and edge extraction prompts (NOT dedup prompts — that's intentional under
the encoder-stays-naive commitment).
Architectural posture: the encoder is content-naïve. It does not draw on
prior knowledge of the user, the substrate, or the cycle's accumulated
work. Schema and personality live in the cycle's consolidated substrate,
where the dream phase shapes them. The encoder produces source-grounded
ground truth for the cycle to work from. See EXTRACTION_INSTRUCTIONS_V1
below for the extraction guidance text.
Routing rationale: the single-episode pathway is the correct API per
graphiti-core's docs for content that supersedes prior facts (it does
edge invalidation that bulk skips). It costs more per chunk because of
the resolve_edge LLM call; the routing rule keeps that cost bounded to
content that actually needs it.
Chunking rationale (preserved from prior versions): Large documents sent
as single episodes cause FalkorDB write lock contention during entity
deduplication. Chunking at ~500 words (matching Stage 1) produces smaller
deduplication passes that don't block. Each document's chunks are linked
via Graphiti's saga mechanism, preserving document structure in the graph.
Per-chunk heartbeat: single-episode pathway updates stage_3_queue.started_at
after each successful chunk POST so a long-running document doesn't cross
the 10-minute stale threshold mid-process and get re-dequeued by another
worker (or the same worker on next loop iteration). started_at thus means
"last activity timestamp" rather than "began at" — semantics that match
the dequeue query's intent (catch dead workers, not slow ones).
Saga-size limit (MAX_CHUNKS_PER_SAGA): 2026-05-01 incident showed bulk
sagas of 17 and 19 chunks deadlock the sidecar's Python-side coordination.
Documents producing more than MAX_CHUNKS_PER_SAGA chunks on the bulk
pathway are split into multiple bulk commits, each tagged with the same
saga value so Graphiti still links them. The single-episode pathway
doesn't need this split since each chunk is its own POST.
Wedge detection: mirrors Stage 2's consecutive_failures pattern with
sidecar restart on threshold.
Runs as systemd service: aaronai-stage3.service
"""
@@ -44,17 +80,104 @@ HEARTBEAT_FILE = Path("/var/log/aaronai/stage3-heartbeat")
RETRY_ATTEMPTS = 2
POLL_INTERVAL = 5
INGEST_TIMEOUT = 600
WORKER_VERSION = "2.2"
WORKER_VERSION = "2.4"
# Match Stage 1 chunking parameters
CHUNK_SIZE_WORDS = 500
CHUNK_OVERLAP_WORDS = 50
# Documents under this threshold ingested as single episode (no chunking overhead)
SINGLE_EPISODE_THRESHOLD = 1500
# Sagas larger than this many chunks split into multiple commits
# Bulk-pathway sagas larger than this many chunks split into multiple commits
# (2026-05-01 incident: 17 and 19 chunk sagas deadlocked sidecar)
MAX_CHUNKS_PER_SAGA = 10
# Routing rule: single-episode pathway requires both signals positive.
# Anything else (false, NULL, low confidence) routes to bulk — the
# safer-cheaper default. Mistral parse drift can't accidentally trigger
# the expensive pathway.
HIGH_TRUST_CONFIDENCE = ("medium", "high")
# Encoder extraction guidance v1.0 — see module docstring for posture rationale.
# Passed to graphiti-core via custom_extraction_instructions on both ingest
# pathways. Inserted into entity-extraction and edge-extraction prompts only;
# does NOT enter dedup prompts. Encoder-stays-naïve commitment is structural,
# not versioned: this text gets refined over time but the encoder does not
# acquire substrate context as the cycle matures.
EXTRACTION_INSTRUCTIONS_V1 = """\
EXTRACTION GUIDANCE — BirdAI cascade
The encoder's job is faithful capture from this chunk's text. It does
not draw on prior knowledge of the user, the substrate, or the cycle's
accumulated work. Schema, personality, and inferred context live in
the cycle's consolidated substrate, where the dream phase shapes them
through prediction-error replay and speculation. The encoder stays
content-naïve so the cycle has source-grounded ground truth to work
from.
The orientation produced by an upstream pass describes content shape,
not content interpretation. Use it as forward-facing guidance for what
to attend to in this document. Do not let it bound or limit what you
extract.
PREDICATE NAMING
Produce semantic predicates that describe the actual relationship the
text states. Use verbs or verb phrases — "wrote", "advised", "founded",
"works at", "led to", "contradicts", "is autobiographical to" — not
generic placeholders. Reserve generic forms (for example, "relates to"
or "mentions") for cases where the text genuinely does not specify a
more particular relationship. The verb is the load-bearing part of
the fact; preserving it is what makes the relationship queryable later.
EXTRACTION POSTURE
Extract from this chunk's text as if each entity is encountered fresh.
Do not try to reconcile entities you find here with entities that
might already exist elsewhere in the graph. Redundant entity instances
are acceptable. Cross-document entity resolution is downstream cycle
work, not extraction work.
When the same entity appears multiple times within this chunk with
slightly different spellings — a common artifact of voice transcription —
prefer the more frequent or more canonical-looking form. Do not invent
canonical forms; choose among the variants the text actually contains.
EXTRACT FROM THE SOURCE
Extract relationships the text states or strongly implies through
direct linguistic markers ("X led to Y", "X works for Y", "X met Y at
Z"). Do not extend extraction to relationships the text neither states
nor directly implies. Inferred relationships are produced by the
cycle's dream phase as speculative edges with explicit low-confidence
tagging, where they can be evaluated and either ratified or pruned by
subsequent cycle work. Encoding-time inference, mixed in with source-
grounded extraction, would lose the speculation/source distinction the
cycle's consolidation work relies on.
DO NOT PRE-EMPT CYCLE WORK
Do not omit relationships because they seem redundant with prior
extractions or with the existing graph. Cross-document entity
resolution and edge consolidation are downstream cycle operations;
redundant extraction at this stage is intentional. Extracting the
same fact from multiple sources gives the cycle's consolidation work
the recurrence signal it relies on.
EXTRACTION DEPTH
Use the orientation's frame_relationships and extraction_orientation
fields to inform what to attend to. If the orientation describes
cross-domain relational content, look for relationships that bridge
those domains explicitly, with named predicates for the bridging.
If the orientation describes single-domain technical content, look
for the structural relationships internal to that domain.
Extract every entity and every relationship the text states. Do not
summarize, do not filter, do not omit content because it seems
incidental. The orientation tells you what to look for; the source
text tells you what is there.
"""
def get_pg():
return psycopg2.connect(PG_DSN)
@@ -109,6 +232,22 @@ def chunk_text(text, chunk_size=CHUNK_SIZE_WORDS, overlap=CHUNK_OVERLAP_WORDS):
return chunks
def heartbeat_row(row_id):
"""Refresh stage_3_queue.started_at to NOW() so a long-running single-episode
ingest doesn't cross the 10-minute stale threshold mid-process. Called
after each successful chunk POST. Best-effort: failures are logged but
don't fail the chunk — the worst case is a stale-threshold re-dequeue,
which graphiti's dedup will handle as a no-op."""
try:
pg = get_pg()
cur = pg.cursor()
cur.execute("UPDATE stage_3_queue SET started_at = NOW() WHERE id = %s", (row_id,))
pg.commit()
pg.close()
except Exception as e:
log.warning(f" Heartbeat update failed (continuing): {e}")
def post_bulk(payload, batch_label=""):
"""Single POST to /episodes/bulk with consistent error handling."""
resp = requests.post(
@@ -122,16 +261,34 @@ def post_bulk(payload, batch_label=""):
return resp.json()
def ingest_to_graphiti(source, full_text, orientation):
"""
Ingest document to Graphiti as chunked episodes linked by saga.
def post_episode(payload, episode_label=""):
"""Single POST to /episodes (singular) with consistent error handling.
Used by the single-episode pathway, one call per chunk."""
resp = requests.post(
f"{GRAPHITI_URL}/episodes",
json=payload,
timeout=INGEST_TIMEOUT
)
if not resp.ok:
prefix = f"{episode_label} " if episode_label else ""
raise RuntimeError(f"{prefix}Sidecar {resp.status_code}: {resp.text[:500]}")
return resp.json()
def ingest_bulk(source, full_text, orientation):
"""
Bulk-pathway ingest: documents that don't supersede prior state.
Skips edge invalidation. Cheap. Three sub-paths by document size:
Three paths:
- Short documents (<SINGLE_EPISODE_THRESHOLD): single episode, no saga
[note: 'single episode' here means one bulk call with one item, NOT
the single-episode-pathway; naming overlap is unfortunate but local]
- Medium documents (chunks <= MAX_CHUNKS_PER_SAGA): one bulk commit, saga-linked
- Large documents (chunks > MAX_CHUNKS_PER_SAGA): split into batches of
MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga tag
so Graphiti links them as one document unit
MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga
tag so Graphiti links them as one document unit
All three sub-paths pass EXTRACTION_INSTRUCTIONS_V1 to the sidecar.
"""
char_length = len(full_text)
@@ -142,8 +299,12 @@ def ingest_to_graphiti(source, full_text, orientation):
"source_description": orientation,
"timestamp": datetime.now().isoformat(),
}]
log.info(f" Single episode ({char_length} chars)")
return post_bulk({"episodes": episodes, "group_id": "aaron"})
log.info(f" [bulk] Single episode ({char_length} chars)")
return post_bulk({
"episodes": episodes,
"group_id": "aaron",
"custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
})
chunks = chunk_text(full_text)
total_chunks = len(chunks)
@@ -158,15 +319,18 @@ def ingest_to_graphiti(source, full_text, orientation):
}
for i, chunk in enumerate(chunks)
]
log.info(f" Chunked into {total_chunks} episodes ({char_length} chars)")
return post_bulk(
{"episodes": episodes, "group_id": "aaron", "saga": source}
)
log.info(f" [bulk] Chunked into {total_chunks} episodes ({char_length} chars)")
return post_bulk({
"episodes": episodes,
"group_id": "aaron",
"saga": source,
"custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
})
# Large document: split into batches sharing the same saga tag
batch_count = (total_chunks + MAX_CHUNKS_PER_SAGA - 1) // MAX_CHUNKS_PER_SAGA
log.info(
f" Chunked into {total_chunks} episodes ({char_length} chars); "
f" [bulk] Chunked into {total_chunks} episodes ({char_length} chars); "
f"splitting into {batch_count} batches of up to {MAX_CHUNKS_PER_SAGA}"
)
last_result = None
@@ -186,16 +350,126 @@ def ingest_to_graphiti(source, full_text, orientation):
batch_label = f"batch {batch_idx + 1}/{batch_count} (chunks {start + 1}-{end})"
log.info(f" {batch_label} starting")
last_result = post_bulk(
{"episodes": episodes, "group_id": "aaron", "saga": source},
{
"episodes": episodes,
"group_id": "aaron",
"saga": source,
"custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
},
batch_label=batch_label,
)
log.info(f" {batch_label} committed")
return last_result
def ingest_single_episode(row_id, source, full_text, orientation):
"""
Single-episode pathway: documents that supersede prior state with
medium-or-high confidence. Each chunk is its own POST to /episodes
with shared saga tag. Each call independent: own timeout, own retry
envelope, own failure semantics.
Each chunk POST passes EXTRACTION_INSTRUCTIONS_V1 to the sidecar.
Partial-success behavior: if chunk N of total fails, chunks 1..N-1
stay committed (graphiti has already accepted them) and the function
raises with detail about which chunk failed and how many succeeded.
The caller marks the row failed_at with that detail; the operator
decides whether to re-enqueue. Re-ingestion will re-POST chunks 1..N-1
against the graph; graphiti's dedup will handle them as no-ops.
Heartbeats stage_3_queue.started_at after each successful chunk so the
row doesn't cross the 10-minute stale threshold while actively progressing.
"""
char_length = len(full_text)
# Short documents: one POST, no chunking, no saga
if char_length < SINGLE_EPISODE_THRESHOLD:
payload = {
"name": source,
"content": full_text,
"source_description": orientation,
"group_id": "aaron",
"timestamp": datetime.now().isoformat(),
"custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
}
log.info(f" [single-ep] Single episode, no chunking ({char_length} chars)")
return post_episode(payload, episode_label="single-ep")
chunks = chunk_text(full_text)
total_chunks = len(chunks)
log.info(
f" [single-ep] Chunked into {total_chunks} episodes ({char_length} chars); "
f"per-chunk POSTs with shared saga"
)
succeeded = 0
for i, chunk in enumerate(chunks):
chunk_num = i + 1
payload = {
"name": f"{source} [{chunk_num}/{total_chunks}]",
"content": chunk,
"source_description": orientation,
"group_id": "aaron",
"saga": source,
"timestamp": datetime.now().isoformat(),
"custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
}
try:
post_episode(payload, episode_label=f"chunk {chunk_num}/{total_chunks}")
succeeded += 1
log.info(f" chunk {chunk_num}/{total_chunks} committed")
heartbeat_row(row_id)
except Exception as e:
# Annotate the exception with partial-success detail so the
# caller can write a clean failure_reason. Re-raise to abort
# the document; previously-committed chunks stay in the graph.
raise RuntimeError(
f"single_episode_partial: chunk {chunk_num}/{total_chunks} failed "
f"(succeeded: {succeeded}); error: {str(e)[:300]}"
) from e
log.info(f" [single-ep] All {total_chunks} chunks committed")
return {"ok": True, "chunks_committed": total_chunks}
def should_route_single_episode(supersedes_prior_state, state_type_confidence):
"""Routing decision for Phase A.
Single-episode pathway requires BOTH:
- supersedes_prior_state is true (Mistral judged it temporally superseding)
- confidence is medium or high (Mistral was confident enough to trust)
Anything else routes to bulk: false supersedes, NULL fields (legacy rows
pre-dating Stage 2 v2.2), low confidence even on supersedes=true. This
is the safer-cheaper default — bulk skips temporal invalidation, which
is the right behavior when we're not confident the content needs it.
"""
if not supersedes_prior_state:
return False
if state_type_confidence not in HIGH_TRUST_CONFIDENCE:
return False
return True
def process_one(row):
row_id, source, full_text, orientation = row
log.info(f"Ingesting to Graphiti: {source}")
(row_id, source, full_text, orientation,
state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale) = row
# Route decision
use_single_episode = should_route_single_episode(
supersedes_prior_state, state_type_confidence
)
pathway = "single-episode" if use_single_episode else "bulk"
log.info(
f"Ingesting to Graphiti: {source} "
f"[pathway={pathway}, state_type={state_type}, "
f"conf={state_type_confidence}, supersedes={supersedes_prior_state}]"
)
if state_type_rationale:
log.info(f" rationale: {state_type_rationale[:200]}")
pg = get_pg()
cur = pg.cursor()
@@ -204,9 +478,16 @@ def process_one(row):
(row_id,)
)
pg.commit()
pg.close()
try:
result = ingest_to_graphiti(source, full_text, orientation)
if use_single_episode:
result = ingest_single_episode(row_id, source, full_text, orientation)
else:
result = ingest_bulk(source, full_text, orientation)
pg = get_pg()
cur = pg.cursor()
cur.execute("UPDATE stage_3_queue SET completed_at = NOW() WHERE id = %s", (row_id,))
pg.commit()
pg.close()
@@ -214,6 +495,8 @@ def process_one(row):
return True
except Exception as e:
log.error(f" Graphiti ingest failed for {source}: {e}")
pg = get_pg()
cur = pg.cursor()
cur.execute("""
UPDATE stage_3_queue
SET failed_at = NOW(), failure_reason = %s
@@ -235,7 +518,9 @@ def run():
pg = get_pg()
cur = pg.cursor()
cur.execute("""
SELECT id, source, full_text, orientation
SELECT id, source, full_text, orientation,
state_type, state_type_confidence, supersedes_prior_state,
state_type_rationale
FROM stage_3_queue
WHERE completed_at IS NULL
AND failed_at IS NULL
-123
View File
@@ -1,123 +0,0 @@
"""One-off: remove embeddings rows that no longer correspond to a file on disk.
Two passes:
1. Modern rows (metadata.filepath set): check each filepath, delete if missing.
2. Legacy rows (metadata.filepath null): build a set of all basenames present
anywhere under NEXTCLOUD_PATH, then delete rows whose `source` basename
isn't in that set.
Default mode is a dry-run (counts + sample paths, no writes). Pass --apply to
actually delete.
"""
import os
import sys
from pathlib import Path
from collections import defaultdict
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
import psycopg2
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
APPLY = "--apply" in sys.argv
def get_pg():
return psycopg2.connect(os.environ["PG_DSN"])
def scan_modern_orphans():
"""Rows with metadata.filepath whose file doesn't exist on disk."""
pg = get_pg()
cur = pg.cursor()
cur.execute(
"SELECT id, source, metadata->>'filepath' AS filepath "
"FROM embeddings WHERE metadata->>'filepath' IS NOT NULL"
)
orphans = []
by_source = defaultdict(int)
for row in cur.fetchall():
fp = row[2]
if fp and not Path(fp).exists():
orphans.append(row)
by_source[row[1]] += 1
pg.close()
return orphans, by_source
def scan_legacy_orphans():
"""Rows without metadata.filepath whose basename isn't anywhere under
NEXTCLOUD_PATH. Restricted to type='document' so conversations and memory
snapshots (which are synthetic sources, not files on disk) aren't flagged
as orphans. Walks the filesystem once to build the basename set."""
print(f" walking {NEXTCLOUD_PATH} to build basename index...")
on_disk = set()
for p in NEXTCLOUD_PATH.rglob("*"):
if p.is_file():
on_disk.add(p.name)
print(f" {len(on_disk):,} files on disk")
pg = get_pg()
cur = pg.cursor()
cur.execute(
"SELECT id, source FROM embeddings "
"WHERE metadata->>'filepath' IS NULL AND type = 'document'"
)
orphans = []
by_source = defaultdict(int)
for row in cur.fetchall():
if row[1] not in on_disk:
orphans.append(row)
by_source[row[1]] += 1
pg.close()
return orphans, by_source
def delete_rows(ids):
pg = get_pg()
cur = pg.cursor()
cur.execute("DELETE FROM embeddings WHERE id = ANY(%s)", (list(ids),))
deleted = cur.rowcount
pg.commit()
pg.close()
return deleted
def main():
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
print(f"Target: {NEXTCLOUD_PATH}")
print()
print("Pass 1 — modern rows (metadata.filepath set):")
modern, modern_by_src = scan_modern_orphans()
print(f" {len(modern):,} orphan rows across {len(modern_by_src):,} files")
for src, n in sorted(modern_by_src.items(), key=lambda kv: -kv[1])[:10]:
print(f" {n:>4} chunks — {src}")
print()
print("Pass 2 — legacy rows (no metadata.filepath):")
legacy, legacy_by_src = scan_legacy_orphans()
print(f" {len(legacy):,} orphan rows across {len(legacy_by_src):,} files")
for src, n in sorted(legacy_by_src.items(), key=lambda kv: -kv[1])[:10]:
print(f" {n:>4} chunks — {src}")
print()
total = len(modern) + len(legacy)
if total == 0:
print("Nothing to delete.")
return
if not APPLY:
print(f"Dry-run only. Re-run with --apply to delete {total:,} rows.")
return
print(f"Deleting {total:,} orphan rows...")
n1 = delete_rows([r[0] for r in modern]) if modern else 0
n2 = delete_rows([r[0] for r in legacy]) if legacy else 0
print(f" modern: {n1:,} legacy: {n2:,} total: {n1 + n2:,}")
if __name__ == "__main__":
main()
-53
View File
@@ -1,53 +0,0 @@
"""End-to-end test of retrieve_context with intent routing + reranking.
Avoids loading the full FastAPI app; replicates the chat-handler retrieval
call shape and prints classifier output + final ranked sources for each query.
"""
import os
import sys
from pathlib import Path
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
sys.path.insert(0, str(Path(__file__).parent))
# Stub anthropic so api.py import doesn't fail without the SDK loaded.
# We only need retrieve_context.
import types
sys.modules.setdefault("anthropic", types.ModuleType("anthropic"))
sys.modules["anthropic"].Anthropic = lambda **kw: None
# Same for whisper if present
if "faster_whisper" not in sys.modules:
sys.modules["faster_whisper"] = types.ModuleType("faster_whisper")
import importlib.util
spec = importlib.util.spec_from_file_location("api", Path(__file__).parent / "api.py")
api = importlib.util.module_from_spec(spec)
# Don't execute the whole module (it starts FastAPI). Instead, exec only definitions.
# Easier: just import the functions we need by exec'ing the file but catching errors.
try:
spec.loader.exec_module(api)
except Exception as e:
print(f"(continuing despite api.py side-effect error: {e})")
retrieve_context = api.retrieve_context
QUERIES = [
"write me a bio",
"my professional bio",
"Aaron Nelson CV consulting and design work",
"FWN3D consulting",
"syllabi I have taught",
"philosophy of teaching",
"Hudson Valley Additive Manufacturing Center",
"Aaron Nelson is an artist and educator working in additive manufacturing",
]
for q in QUERIES:
pieces, sources = retrieve_context(q)
print(f"\n=== {q!r} ===")
for i, src in enumerate(sources, 1):
print(f" {i}. {src}")
+91 -142
View File
@@ -19,6 +19,7 @@ Architecture: Stage 1 (watcher) -> stage_2_queue -> Stage 2 (Mistral) -> stage_3
import os
import time
import json
import hashlib
import logging
import threading
from pathlib import Path
@@ -29,11 +30,9 @@ from sentence_transformers import SentenceTransformer
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
from failures import (
record_ingest_failure as _record_failure_sql,
resolve_ingest_failure as _resolve_failure_sql,
)
from docx import Document as DocxDocument
from pypdf import PdfReader
from pptx import Presentation
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
@@ -43,7 +42,10 @@ STATE_FILE = "/home/aaron/aaronai/watcher_state.json"
STATUS_FILE = "/home/aaron/aaronai/watcher_status.json"
HEARTBEAT_FILE = "/home/aaron/aaronai/watcher_heartbeat"
SUPPORTED = {".pdf", ".docx", ".pptx", ".txt", ".md"}
DEBOUNCE_SECONDS = 120
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBED_MODEL = "all-MiniLM-L6-v2"
PG_DSN = os.getenv("PG_DSN")
@@ -74,6 +76,49 @@ def get_pg():
return psycopg2.connect(PG_DSN)
def extract_text(path: Path) -> str:
suffix = path.suffix.lower()
try:
if suffix == ".docx":
doc = DocxDocument(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
elif suffix == ".pdf":
reader = PdfReader(path)
return "".join(
page.extract_text() + "\n"
for page in reader.pages if page.extract_text()
)
elif suffix == ".pptx":
prs = Presentation(path)
return "\n".join(
shape.text for slide in prs.slides
for shape in slide.shapes
if hasattr(shape, "text") and shape.text.strip()
)
elif suffix in {".txt", ".md"}:
return path.read_text(encoding="utf-8", errors="ignore")
except Exception as e:
log.warning(f"Text extraction failed for {path.name}: {e}")
record_ingest_failure(path, f"Text extraction failed: {e}")
return ""
def chunk_text(text: str) -> list:
words = text.split()
chunks = []
start = 0
while start < len(words):
chunk = " ".join(words[start:start + CHUNK_SIZE])
if chunk.strip():
chunks.append(chunk)
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def make_chunk_id(filepath: Path, chunk_index: int) -> str:
return hashlib.md5(str(filepath).encode()).hexdigest()[:8] + f"_{chunk_index}"
def enqueue_stage2(source: str, full_text: str):
if os.getenv("SKIP_STAGE2_ENQUEUE"):
return
@@ -98,14 +143,20 @@ def enqueue_stage2(source: str, full_text: str):
def record_ingest_failure(filepath: Path, error: str):
"""Write extraction or ingest failure to ingest_failures table for UI visibility.
Local wrapper around failures.record_ingest_failure — opens conn, delegates,
logs non-fatal errors so the caller never has to handle them."""
"""Write extraction or ingest failure to ingest_failures table for UI visibility."""
try:
pg = get_pg()
try:
_record_failure_sql(pg, filepath.name, filepath, error)
finally:
cur = pg.cursor()
cur.execute("""
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
VALUES (%s, %s, %s, 0, NOW(), NOW())
ON CONFLICT (source) DO UPDATE SET
error = EXCLUDED.error,
retry_count = ingest_failures.retry_count + 1,
last_failed_at = NOW(),
resolved = FALSE
""", (filepath.name, str(filepath), error[:1000]))
pg.commit()
pg.close()
except Exception as e:
log.warning(f"Could not record ingest failure (non-fatal): {e}")
@@ -115,104 +166,57 @@ def resolve_ingest_failure(source: str):
"""Mark a previously failed file as resolved after successful ingest."""
try:
pg = get_pg()
try:
_resolve_failure_sql(pg, source)
finally:
cur = pg.cursor()
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,))
pg.commit()
pg.close()
except Exception as e:
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
def delete_embeddings_for_path(filepath: Path):
"""Remove embeddings rows for a file that no longer exists. Matches by
metadata.filepath so multi-folder same-basename files don't collide.
Legacy rows without filepath metadata are left alone — they get cleaned
by sweep_orphans.py."""
try:
pg = get_pg()
try:
cur = pg.cursor()
cur.execute(
"DELETE FROM embeddings WHERE metadata->>'filepath' = %s",
(str(filepath),),
)
deleted = cur.rowcount
pg.commit()
if deleted:
log.info(f"Deleted {deleted} chunks for removed file: {filepath}")
finally:
pg.close()
except Exception as e:
log.warning(f"Could not delete embeddings for {filepath} (non-fatal): {e}")
def remove_from_state(filepath: Path):
"""Drop a deleted file from watcher_state.json so it isn't carried as
'known mtime' indefinitely."""
try:
state = load_state()
key = str(filepath)
if key in state:
del state[key]
save_state(state)
except Exception as e:
log.warning(f"Could not update state for deleted {filepath} (non-fatal): {e}")
IGNORED_TOP_FOLDERS = {"Drafts"}
def ingest_file(filepath: Path, embedder) -> int:
if filepath.name.startswith(("~$", "~", ".")):
if filepath.name.startswith(("~$", ".")):
return 0
if filepath.suffix.lower() not in SUPPORTED:
return 0
try:
rel = filepath.parent.relative_to(NEXTCLOUD_PATH)
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
text = extract_text(filepath)
if not text.strip():
return 0
except ValueError:
pass
blocks = extract_blocks(filepath)
if not blocks or not any(
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
for b in blocks
):
record_ingest_failure(filepath, "Text extraction failed or empty")
chunks = chunk_text(text)
if not chunks:
return 0
folder_rel = None
try:
folder_rel = str(filepath.parent.relative_to(NEXTCLOUD_PATH))
except ValueError:
pass
try:
rows = chunk_and_embed(blocks, filepath.name, embedder,
filepath=filepath, folder=folder_rel)
embeddings = embedder.encode(chunks).tolist()
except Exception as e:
log.error(f"Embedding failed for {filepath.name}: {e}")
record_ingest_failure(filepath, f"Embedding failed: {e}")
return 0
if not rows:
return 0
source = filepath.name
try:
pg = get_pg()
try:
write_embeddings_batch(pg, rows)
finally:
cur = pg.cursor()
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
chunk_id = make_chunk_id(filepath, i)
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
metadata = EXCLUDED.metadata
""", (chunk_id, chunk, embedding, source, "document",
json.dumps({"source": source, "filepath": str(filepath)})))
pg.commit()
pg.close()
except Exception as e:
log.error(f"pgvector write failed for {filepath.name}: {e}")
record_ingest_failure(filepath, f"pgvector write failed: {e}")
return 0
log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
log.info(f"Indexed {len(chunks)} chunks: {filepath.name}")
resolve_ingest_failure(source)
full_text = "\n".join(
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
for b in blocks
)
enqueue_stage2(source, full_text)
return len(rows)
enqueue_stage2(source, text)
return len(chunks)
def ingest_files(paths: list, embedder, state: dict) -> dict:
@@ -220,7 +224,6 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
for path in paths:
count = ingest_file(path, embedder)
total += count
if count > 0:
state[str(path)] = str(path.stat().st_mtime)
log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
return state
@@ -249,24 +252,12 @@ def get_changed_files(state: dict) -> list:
continue
if path.suffix.lower() not in SUPPORTED:
continue
if path.name.startswith((".", "~$", "~")):
if path.name.startswith((".", "~$")):
continue
if "Admin/Backups" in str(path) or "Backups" in path.parts:
continue
if "Journal/Media" in str(path):
continue
if "Generative Design" in path.parts and "Processing" in path.parts:
continue
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
continue
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
and "Presentations" in path.parts:
continue
if path.name == "GH Slicer Notes [Autosaved].pptx" \
and "DDF555 3D Computational" in path.parts:
continue
if path.stat().st_size == 0:
continue
if state.get(str(path)) != str(path.stat().st_mtime):
changed.append(path)
return changed
@@ -345,22 +336,12 @@ class IngestHandler(FileSystemEventHandler):
self.last_event = 0
def _should_ignore(self, path: Path) -> bool:
if path.name.startswith((".", "~$", "~")):
if path.name.startswith((".", "~$")):
return True
if "Admin/Backups" in str(path) or "Backups" in path.parts:
return True
if "Journal/Media" in str(path):
return True
if "Generative Design" in path.parts and "Processing" in path.parts:
return True
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
return True
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
and "Presentations" in path.parts:
return True
if path.name == "GH Slicer Notes [Autosaved].pptx" \
and "DDF555 3D Computational" in path.parts:
return True
return False
def on_created(self, event):
@@ -386,47 +367,15 @@ class IngestHandler(FileSystemEventHandler):
def on_moved(self, event):
if event.is_directory:
return
src = Path(event.src_path)
dest = Path(event.dest_path)
# If destination is outside NEXTCLOUD_PATH (e.g., Nextcloud trashbin at
# /home/aaron/nextcloud/data/data/aaron/files_trashbin/), treat as a
# delete — the file is no longer in the watched corpus.
try:
dest.relative_to(NEXTCLOUD_PATH)
except ValueError:
if src.suffix.lower() in SUPPORTED:
log.info(f"Event: moved out of tree {src} -> {dest}")
threading.Thread(
target=lambda: (
delete_embeddings_for_path(src),
remove_from_state(src),
),
daemon=True,
).start()
return
# Nextcloud WebDAV writes .part temp files then renames to final path.
# src_path is the .part file; dest_path is the final filename.
dest = Path(event.dest_path)
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
return
log.info(f"Event: moved -> {dest}")
self.pending = True
self.last_event = time.time()
def on_deleted(self, event):
if event.is_directory:
return
path = Path(event.src_path)
if path.suffix.lower() not in SUPPORTED:
return
log.info(f"Event: deleted {path}")
threading.Thread(
target=lambda: (
delete_embeddings_for_path(path),
remove_from_state(path),
),
daemon=True,
).start()
def on_closed(self, event):
# FileClosedEvent fires on the final file after Nextcloud completes write.
# Belt-and-suspenders catch for any write pattern not caught by on_moved.