docs/inventory: layer 2026-05-03 updates (resolutions, corrections, new findings)

Inventory dated 2026-05-02 is preserved as a point-in-time snapshot. Today's updates are layered on top in a dated addendum section after "Findings summary" and before "Phase 1 — Scripts" so the original snapshot reads as written and readers can see what changed and when. Resolved: - NREM-shape divergence #1 (`dream.py` cumulative cross-night exclusion 500-cap) — replaced with session-scoped novelty. Corrections to existing findings: - `stage2_metadata` lives on `stage_3_queue`, not `stage_2_queue` (the 2026-05-02 entry implied otherwise). Verified by direct schema read. - Stage 2 char_length gate runs *before* the Mistral call. For sub-2000-char docs, Mistral is never invoked — frames are not extracted then discarded, they are simply not extracted. Reframes the architecture's "Stage 2 produces orientation for everything" commitment. New findings (from the 2026-05-03 frame analysis): - `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Combined with the char-gate exclusion and Stage 2 failures, only 56% of corpus has any frame data. - All 14 voice notes and all 39 dream outputs are in the 339-doc gap. Primary capture and self-reflection channels are silent to the frame system; dreamer cannot frame-condition on its own output. - File-type \u00d7 frame stratification provides discriminating signal that cross-links Improvement #3 to the existing `embeddings.type` NULL-rate finding. Same NREM shape as the original cumulative-exclusion bug — the architecture's stated commitment and what the code actually does diverge silently. This is exactly what the inventory exists to surface.
2026-05-03 20:32:55 +00:00
parent ed2d090afc
commit 2df1a2fe01
1 changed files with 32 additions and 0 deletions
@@ -65,6 +65,38 @@ The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that m
 ---
 ## Updates — 2026-05-03 session
 *Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
 ### Resolved
 - **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM ∪ Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
 ### Corrections to existing findings
 - **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
 - **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
 ### New findings from 2026-05-03 frame analysis (Improvement #3)
 - **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
 - **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
 - **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
 ### Artifacts produced 2026-05-03
 - **Code change:** `scripts/dream.py` (Improvement #1).
 - **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
 - **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
 - **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
 - **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
 ---
 ## Phase 1 — Scripts