# Stage 2 Frame Analysis — 2026-05-03 *Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).* **Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.** --- ## Verdict **Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident. Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole. --- ## 1. Corpus-wide frame coverage | Class | Count | % of corpus | Frame coverage | |---|---|---|---| | Total distinct sources in `embeddings` | 1,255 | 100% | — | | Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes | | Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** | | Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** | | Files that failed Stage 2 | 12 | 1.0% | none | **56.1% frame coverage** is the headline. The architectural reason for the gap is twofold: 1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop. 2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced. ## 2. Frame distribution (the docs that DO have frames) **668 docs, 1,374 distinct frame labels. Top-20 by count:** | Frame | Count | % of frame-extracted docs | |---|---|---| | Education | 238 | 35.6% | | Course | 58 | 8.7% | | Programming | 43 | 6.4% | | Design | 32 | 4.8% | | Professional Experience | 24 | 3.6% | | Employment | 24 | 3.6% | | Research | 23 | 3.4% | | 3D Printing | 22 | 3.3% | | Project, Grading, Art, Budget | 21 each | 3.1% | | Academic Integrity | 20 | 3.0% | | Teaching, Technology, Attendance, Application | 13–19 | — | | Accommodation, Manufacturing, Coursework, Recommendation | 10–13 | — | **Per-doc frame count:** median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented). **Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy. **"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing." ## 3. Label hygiene **54 normalized collisions** detected (case-insensitive, underscore-vs-space): | Concept | Variant counts | |---|---| | Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 | | 3D Printing | `3D Printing`:22 + `3D_Printing`:7 | | Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 | | Course Design | `Course Design`:9 + `Course_Design`:1 | | Project Management | `Project Management`:7 + `Project_Management`:1 | | Computational Design | `Computational Design`:7 + `Computational_Design`:1 | | (… 48 more) | | Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved. ## 4. Worker version drift | Worker version | Doc count | Notes | |---|---|---| | v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. | | v2.0 | 3 | Same key shape as v2.1 baseline. | Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.** ## 5. File-type signal This is the most useful Track 2 finding from this report. `stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly: | Frame | pdf | docx | pptx | markdown | txt | dream | |---|---|---|---|---|---|---| | Education | 116 | 119 | 3 | — | — | — | | Course | 29 | 29 | — | — | — | — | | Programming | 12 | 10 | **15** | — | 6 | — | | Application | **13** | 2 | — | — | — | — | | 3D Printing | 11 | 3 | **8** | — | — | — | | Manufacturing | 3 | 6 | 4 | — | — | — | | Research | 9 | 13 | — | 1 | — | — | **Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.** ## 6. Systematic exclusions inside the 339-doc gap Of the 339 short docs that bypass frame extraction, the breakdown by file type: | Type | Count | What this is | |---|---|---| | pdf | 110 | Short PDFs (forms, single-page docs) | | docx | 110 | Short Word docs | | dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** | | pptx | 31 | Short slide decks | | txt | 28 | Plain-text files | | voice_note | 14 | **Every voice note in the corpus** | | markdown | 7 | Short markdown | **Two specific systematic exclusions worth naming separately:** - **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it. - **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output. These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory. --- ## 7. Would frame-conditional routing be a viable γ component, and what would it condition on? **Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset: 1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer. 2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.) 3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design. **What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis. **Defined scope (the coverage caveat):** The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options: - **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run. - **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type. - **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is). (a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.** --- ## 8. Recommended follow-ups (ordered by ROI) 1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent. 2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole. 3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view. 4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames. 5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already. 6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits. --- ## 9. Inventory edits flagged for session-end batch - **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected. - **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment. - **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request. - **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5. - **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request. ## 10. Reproduction ```bash cd ~/aaronai venv/bin/python3 scripts/experiments/frame_distribution_report.py # stdout: human-readable report # json: experiments/frame_distribution_.json # view: stage2_frames_v (in pgvector DB) ``` The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.