Files
aaronAI/docs/stage2-frame-analysis-2026-05-03.md
T
aaron ed2d090afc experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3)
Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).

Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
  (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
  fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
  co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
  data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.

Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
  validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
  Stage 2 by design (`ingest_conversations.py` writes directly to
  embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
  12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
  Primary capture and self-reflection channels are silent to the frame
  system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
  `Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
  cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
  prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
  Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
  rows.

No production code touched. View is droppable; script is read-only.
2026-05-03 20:32:37 +00:00

13 KiB
Raw Blame History

Stage 2 Frame Analysis — 2026-05-03

Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).

Data source: stage_3_queue.stage2_metadata (jsonb), exposed via the new SQL view stage2_frames_v. Analysis script: scripts/experiments/frame_distribution_report.py. Sidecar JSON: experiments/frame_distribution_2026-05-03.json. Stage 3 service is currently stopped, so this is a stable snapshot.


Verdict

Frames cluster meaningfully but coverage is partial. Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. However, only 56% of the embeddings corpus has any frame data at all. The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.

Frame-conditional routing is a viable γ component candidate for the document side of the corpus. It is not a viable router for the conversational or self-generated side without filling the coverage hole.


1. Corpus-wide frame coverage

Class Count % of corpus Frame coverage
Total distinct sources in embeddings 1,255 100%
Files with frames (stage_3_queue.stage2_metadata) 704 56.1% yes
Conversations (Claude / ChatGPT / Aaron AI) 198 15.8% none — bypass Stage 2 by design
Files <2,000 chars (Stage 2 char-gate skip) 339 27.0% none — Mistral never invoked
Files that failed Stage 2 12 1.0% none

56.1% frame coverage is the headline. The architectural reason for the gap is twofold:

  1. ingest_conversations.py writes directly to embeddings with type='aaronai_conversation' and never enqueues to stage_2_queue. Conversations have never been frame-extracted, full stop.
  2. stage2_worker.py:139 gates Mistral on char_length. Docs <2,000 chars are marked complete with completed_at = NOW() before Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.

2. Frame distribution (the docs that DO have frames)

668 docs, 1,374 distinct frame labels. Top-20 by count:

Frame Count % of frame-extracted docs
Education 238 35.6%
Course 58 8.7%
Programming 43 6.4%
Design 32 4.8%
Professional Experience 24 3.6%
Employment 24 3.6%
Research 23 3.4%
3D Printing 22 3.3%
Project, Grading, Art, Budget 21 each 3.1%
Academic Integrity 20 3.0%
Teaching, Technology, Attendance, Application 1319
Accommodation, Manufacturing, Coursework, Recommendation 1013

Per-doc frame count: median 34 frames per doc; 76% of docs have 35 frames; one outlier doc has 30 frames (Mistral over-segmented).

Long tail is enormous. 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.

"Education" is the universal frame. It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."

3. Label hygiene

54 normalized collisions detected (case-insensitive, underscore-vs-space):

Concept Variant counts
Professional Experience Professional Experience:24 + Professional_Experience:6
3D Printing 3D Printing:22 + 3D_Printing:7
Academic Integrity Academic Integrity:20 + Academic_Integrity:2
Course Design Course Design:9 + Course_Design:1
Project Management Project Management:7 + Project_Management:1
Computational Design Computational Design:7 + Computational_Design:1
(… 48 more)

Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.

4. Worker version drift

Worker version Doc count Notes
v2.1 665 Two ad-hoc-key intrusions: academic_details (1 doc), additional_information (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema.
v2.0 3 Same key shape as v2.1 baseline.

Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. For Track 2 substrate ingest, plan for stage2_metadata to occasionally include unexpected top-level keys.

5. File-type signal

This is the most useful Track 2 finding from this report.

stage_3_queue.source stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:

Frame pdf docx pptx markdown txt dream
Education 116 119 3
Course 29 29
Programming 12 10 15 6
Application 13 2
3D Printing 11 3 8
Manufacturing 3 6 4
Research 9 13 1

Concrete signal: "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. embeddings.type is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.

6. Systematic exclusions inside the 339-doc gap

Of the 339 short docs that bypass frame extraction, the breakdown by file type:

Type Count What this is
pdf 110 Short PDFs (forms, single-page docs)
docx 110 Short Word docs
dream_output 39 The dreamer's own NREM/Early-REM/Late-REM/synthesis files
pptx 31 Short slide decks
txt 28 Plain-text files
voice_note 14 Every voice note in the corpus
markdown 7 Short markdown

Two specific systematic exclusions worth naming separately:

  • All 14 voice notes have no frames. Voice is one of Aaron's primary capture channels. The frame system is silent on it.
  • All 39 dream outputs have no frames. The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.

These are NREM-shape findings: the architecture's frame extraction is quietly not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.


7. Would frame-conditional routing be a viable γ component, and what would it condition on?

Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification. The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:

  1. Normalize labels before any routing decision. 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
  2. Treat "Education" as a near-universal prior, not a frame. It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the base case and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
  3. Combine frames with file type, not frames alone. Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.

What it would condition on: the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length) rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.

Defined scope (the coverage caveat):

The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:

  • (a) Backfill frames for short docs and conversations. Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
  • (b) Use a degraded fallback for unframed docs. File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
  • (c) Accept the gap as a scope limit. The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).

(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. Recommend (a) before any router work begins.


  1. Backfill the 339 short docs. Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.

  2. Backfill conversations into frame extraction. Either modify ingest_conversations.py to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.

  3. Add a frame-label normalizer at the worker. New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.

  4. Decide whether to deprecate "Education" as a frame. It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.

  5. Per-frame retrieval-similarity follow-up (deferred from this report). Now that we know frames cluster meaningfully, instrumenting dream.py to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.

  6. Diagnose the "Education" dominance: prompt artifact vs. corpus shape. Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as truly academic content vs. Education was a default Mistral reached for. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.


9. Inventory edits flagged for session-end batch

  • Correction: stage2_metadata lives on stage_3_queue.stage2_metadata (jsonb), not on stage_2_queue as the inventory implied. The Phase 1 / stage2_worker.py entry should be corrected.
  • New finding: the char_length gate runs before the Mistral call (stage2_worker.py:139 precedes :147). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
  • New finding: ingest_conversations.py bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
  • New finding (cross-link to #2): embeddings.type NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
  • New finding: Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: all 14 voice notes and all 39 dream outputs are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.

10. Reproduction

cd ~/aaronai
venv/bin/python3 scripts/experiments/frame_distribution_report.py
# stdout: human-readable report
# json: experiments/frame_distribution_<date>.json
# view: stage2_frames_v (in pgvector DB)

The view is CREATE OR REPLACE, idempotent. Drop with DROP VIEW stage2_frames_v; if needed.