ed2d090afc7a26b32c34b097fdf5210fca326085
Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only.
Description
No description provided
Languages
Python
95.9%
HTML
3.7%
Shell
0.4%