Files

T

aaron ed2d090afc experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3 )

Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).

Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
  (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
  fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
  co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
  data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.

Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
  validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
  Stage 2 by design (`ingest_conversations.py` writes directly to
  embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
  12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
  Primary capture and self-reflection channels are silent to the frame
  system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
  `Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
  cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
  prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
  Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
  rows.

No production code touched. View is droppable; script is read-only.

2026-05-03 20:32:37 +00:00

13 KiB

Raw Blame History

Stage 2 Frame Analysis — 2026-05-03

Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).

Data source: stage_3_queue.stage2_metadata (jsonb), exposed via the new SQL view stage2_frames_v. Analysis script: scripts/experiments/frame_distribution_report.py. Sidecar JSON: experiments/frame_distribution_2026-05-03.json. Stage 3 service is currently stopped, so this is a stable snapshot.

Verdict

Frames cluster meaningfully but coverage is partial. Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. However, only 56% of the embeddings corpus has any frame data at all. The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.

Frame-conditional routing is a viable γ component candidate for the document side of the corpus. It is not a viable router for the conversational or self-generated side without filling the coverage hole.

1. Corpus-wide frame coverage

Class	Count	% of corpus	Frame coverage
Total distinct sources in `embeddings`	1,255	100%	—
Files with frames (`stage_3_queue.stage2_metadata`)	704	56.1%	yes
Conversations (Claude / ChatGPT / Aaron AI)	198	15.8%	none — bypass Stage 2 by design
Files <2,000 chars (Stage 2 char-gate skip)	339	27.0%	none — Mistral never invoked
Files that failed Stage 2	12	1.0%	none

56.1% frame coverage is the headline. The architectural reason for the gap is twofold:

ingest_conversations.py writes directly to embeddings with type='aaronai_conversation' and never enqueues to stage_2_queue. Conversations have never been frame-extracted, full stop.
stage2_worker.py:139 gates Mistral on char_length. Docs <2,000 chars are marked complete with completed_at = NOW() before Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.

2. Frame distribution (the docs that DO have frames)

668 docs, 1,374 distinct frame labels. Top-20 by count:

Frame	Count	% of frame-extracted docs
Education	238	35.6%
Course	58	8.7%
Programming	43	6.4%
Design	32	4.8%
Professional Experience	24	3.6%
Employment	24	3.6%
Research	23	3.4%
3D Printing	22	3.3%
Project, Grading, Art, Budget	21 each	3.1%
Academic Integrity	20	3.0%
Teaching, Technology, Attendance, Application	13–19	—
Accommodation, Manufacturing, Coursework, Recommendation	10–13	—

Per-doc frame count: median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented).

Long tail is enormous. 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.

"Education" is the universal frame. It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."

3. Label hygiene

54 normalized collisions detected (case-insensitive, underscore-vs-space):

Concept	Variant counts
Professional Experience	`Professional Experience`:24 + `Professional_Experience`:6
3D Printing	`3D Printing`:22 + `3D_Printing`:7
Academic Integrity	`Academic Integrity`:20 + `Academic_Integrity`:2
Course Design	`Course Design`:9 + `Course_Design`:1
Project Management	`Project Management`:7 + `Project_Management`:1
Computational Design	`Computational Design`:7 + `Computational_Design`:1
(… 48 more)

Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.

4. Worker version drift

Worker version	Doc count	Notes
v2.1	665	Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema.
v2.0	3	Same key shape as v2.1 baseline.

Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. For Track 2 substrate ingest, plan for stage2_metadata to occasionally include unexpected top-level keys.

5. File-type signal

This is the most useful Track 2 finding from this report.

stage_3_queue.source stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:

Frame	pdf	docx	pptx	markdown	txt	dream
Education	116	119	3	—	—	—
Course	29	29	—	—	—	—
Programming	12	10	15	—	6	—
Application	13	2	—	—	—	—
3D Printing	11	3	8	—	—	—
Manufacturing	3	6	4	—	—	—
Research	9	13	—	1	—	—

Concrete signal: "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. embeddings.type is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.

6. Systematic exclusions inside the 339-doc gap

Of the 339 short docs that bypass frame extraction, the breakdown by file type:

Type	Count	What this is
pdf	110	Short PDFs (forms, single-page docs)
docx	110	Short Word docs
dream_output	39	The dreamer's own NREM/Early-REM/Late-REM/synthesis files
pptx	31	Short slide decks
txt	28	Plain-text files
voice_note	14	Every voice note in the corpus
markdown	7	Short markdown

Two specific systematic exclusions worth naming separately:

All 14 voice notes have no frames. Voice is one of Aaron's primary capture channels. The frame system is silent on it.
All 39 dream outputs have no frames. The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.

These are NREM-shape findings: the architecture's frame extraction is quietly not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.

7. Would frame-conditional routing be a viable γ component, and what would it condition on?

Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification. The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:

Normalize labels before any routing decision. 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
Treat "Education" as a near-universal prior, not a frame. It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the base case and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
Combine frames with file type, not frames alone. Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.

What it would condition on: the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length) rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.

Defined scope (the coverage caveat):

The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:

(a) Backfill frames for short docs and conversations. Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
(b) Use a degraded fallback for unframed docs. File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
(c) Accept the gap as a scope limit. The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).

(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. Recommend (a) before any router work begins.

8. Recommended follow-ups (ordered by ROI)

Backfill the 339 short docs. Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
Backfill conversations into frame extraction. Either modify ingest_conversations.py to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
Add a frame-label normalizer at the worker. New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
Decide whether to deprecate "Education" as a frame. It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
Per-frame retrieval-similarity follow-up (deferred from this report). Now that we know frames cluster meaningfully, instrumenting dream.py to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
Diagnose the "Education" dominance: prompt artifact vs. corpus shape. Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as truly academic content vs. Education was a default Mistral reached for. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.

9. Inventory edits flagged for session-end batch

Correction: stage2_metadata lives on stage_3_queue.stage2_metadata (jsonb), not on stage_2_queue as the inventory implied. The Phase 1 / stage2_worker.py entry should be corrected.
New finding: the char_length gate runs before the Mistral call (stage2_worker.py:139 precedes :147). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
New finding: ingest_conversations.py bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
New finding (cross-link to #2): embeddings.type NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
New finding: Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: all 14 voice notes and all 39 dream outputs are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.

10. Reproduction

cd ~/aaronai
venv/bin/python3 scripts/experiments/frame_distribution_report.py
# stdout: human-readable report
# json: experiments/frame_distribution_<date>.json
# view: stage2_frames_v (in pgvector DB)

The view is CREATE OR REPLACE, idempotent. Drop with DROP VIEW stage2_frames_v; if needed.

13 KiB Raw Blame History Unescape Escape