From ed2d090afc7a26b32c34b097fdf5210fca326085 Mon Sep 17 00:00:00 2001 From: Aaron Nelson Date: Sun, 3 May 2026 20:32:37 +0000 Subject: [PATCH] experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only. --- docs/stage2-frame-analysis-2026-05-03.md | 175 ++++ .../frame_distribution_2026-05-03.json | 987 ++++++++++++++++++ .../experiments/frame_distribution_report.py | 296 ++++++ 3 files changed, 1458 insertions(+) create mode 100644 docs/stage2-frame-analysis-2026-05-03.md create mode 100644 experiments/frame_distribution_2026-05-03.json create mode 100644 scripts/experiments/frame_distribution_report.py diff --git a/docs/stage2-frame-analysis-2026-05-03.md b/docs/stage2-frame-analysis-2026-05-03.md new file mode 100644 index 0000000..4e1c576 --- /dev/null +++ b/docs/stage2-frame-analysis-2026-05-03.md @@ -0,0 +1,175 @@ +# Stage 2 Frame Analysis — 2026-05-03 + +*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).* + +**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.** + +--- + +## Verdict + +**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident. + +Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole. + +--- + +## 1. Corpus-wide frame coverage + +| Class | Count | % of corpus | Frame coverage | +|---|---|---|---| +| Total distinct sources in `embeddings` | 1,255 | 100% | — | +| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes | +| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** | +| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** | +| Files that failed Stage 2 | 12 | 1.0% | none | + +**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold: + +1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop. +2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced. + +## 2. Frame distribution (the docs that DO have frames) + +**668 docs, 1,374 distinct frame labels. Top-20 by count:** + +| Frame | Count | % of frame-extracted docs | +|---|---|---| +| Education | 238 | 35.6% | +| Course | 58 | 8.7% | +| Programming | 43 | 6.4% | +| Design | 32 | 4.8% | +| Professional Experience | 24 | 3.6% | +| Employment | 24 | 3.6% | +| Research | 23 | 3.4% | +| 3D Printing | 22 | 3.3% | +| Project, Grading, Art, Budget | 21 each | 3.1% | +| Academic Integrity | 20 | 3.0% | +| Teaching, Technology, Attendance, Application | 13–19 | — | +| Accommodation, Manufacturing, Coursework, Recommendation | 10–13 | — | + +**Per-doc frame count:** median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented). + +**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy. + +**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing." + +## 3. Label hygiene + +**54 normalized collisions** detected (case-insensitive, underscore-vs-space): + +| Concept | Variant counts | +|---|---| +| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 | +| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 | +| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 | +| Course Design | `Course Design`:9 + `Course_Design`:1 | +| Project Management | `Project Management`:7 + `Project_Management`:1 | +| Computational Design | `Computational Design`:7 + `Computational_Design`:1 | +| (… 48 more) | | + +Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved. + +## 4. Worker version drift + +| Worker version | Doc count | Notes | +|---|---|---| +| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. | +| v2.0 | 3 | Same key shape as v2.1 baseline. | + +Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.** + +## 5. File-type signal + +This is the most useful Track 2 finding from this report. + +`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly: + +| Frame | pdf | docx | pptx | markdown | txt | dream | +|---|---|---|---|---|---|---| +| Education | 116 | 119 | 3 | — | — | — | +| Course | 29 | 29 | — | — | — | — | +| Programming | 12 | 10 | **15** | — | 6 | — | +| Application | **13** | 2 | — | — | — | — | +| 3D Printing | 11 | 3 | **8** | — | — | — | +| Manufacturing | 3 | 6 | 4 | — | — | — | +| Research | 9 | 13 | — | 1 | — | — | + +**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.** + +## 6. Systematic exclusions inside the 339-doc gap + +Of the 339 short docs that bypass frame extraction, the breakdown by file type: + +| Type | Count | What this is | +|---|---|---| +| pdf | 110 | Short PDFs (forms, single-page docs) | +| docx | 110 | Short Word docs | +| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** | +| pptx | 31 | Short slide decks | +| txt | 28 | Plain-text files | +| voice_note | 14 | **Every voice note in the corpus** | +| markdown | 7 | Short markdown | + +**Two specific systematic exclusions worth naming separately:** + +- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it. +- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output. + +These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory. + +--- + +## 7. Would frame-conditional routing be a viable γ component, and what would it condition on? + +**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset: + +1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer. +2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.) +3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design. + +**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis. + +**Defined scope (the coverage caveat):** + +The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options: + +- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run. +- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type. +- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is). + +(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.** + +--- + +## 8. Recommended follow-ups (ordered by ROI) + +1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent. +2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole. +3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view. +4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames. +5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already. + +6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits. + +--- + +## 9. Inventory edits flagged for session-end batch + +- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected. +- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment. +- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request. +- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5. +- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request. + +## 10. Reproduction + +```bash +cd ~/aaronai +venv/bin/python3 scripts/experiments/frame_distribution_report.py +# stdout: human-readable report +# json: experiments/frame_distribution_.json +# view: stage2_frames_v (in pgvector DB) +``` + +The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed. diff --git a/experiments/frame_distribution_2026-05-03.json b/experiments/frame_distribution_2026-05-03.json new file mode 100644 index 0000000..430c807 --- /dev/null +++ b/experiments/frame_distribution_2026-05-03.json @@ -0,0 +1,987 @@ +{ + "generated_at": "2026-05-03T20:21:33.558462", + "n_docs_with_frames": 668, + "n_distinct_labels": 1374, + "top_30_frames": [ + [ + "Education", + 238 + ], + [ + "Course", + 58 + ], + [ + "Programming", + 43 + ], + [ + "Design", + 32 + ], + [ + "Professional Experience", + 24 + ], + [ + "Employment", + 24 + ], + [ + "Research", + 23 + ], + [ + "3D Printing", + 22 + ], + [ + "Project", + 21 + ], + [ + "Grading", + 21 + ], + [ + "Art", + 21 + ], + [ + "Budget", + 21 + ], + [ + "Academic Integrity", + 20 + ], + [ + "Teaching", + 19 + ], + [ + "Technology", + 18 + ], + [ + "Attendance", + 17 + ], + [ + "Application", + 15 + ], + [ + "Accommodation", + 13 + ], + [ + "Manufacturing", + 13 + ], + [ + "Coursework", + 11 + ], + [ + "Recommendation", + 10 + ], + [ + "Manufacturing Process", + 10 + ], + [ + "Additive Manufacturing", + 10 + ], + [ + "Job Application", + 10 + ], + [ + "Exhibitions", + 10 + ], + [ + "Academic Administration", + 9 + ], + [ + "Communication", + 9 + ], + [ + "Course Design", + 9 + ], + [ + "Veteran and Military Services", + 9 + ], + [ + "Career", + 9 + ] + ], + "label_collisions": { + "conversational": [ + [ + "Conversational", + 1 + ], + [ + "conversational", + 1 + ] + ], + "content": [ + [ + "Content", + 1 + ], + [ + "content", + 1 + ] + ], + "cascade": [ + [ + "Cascade", + 1 + ], + [ + "cascade", + 1 + ] + ], + "education": [ + [ + "Education", + 238 + ], + [ + "education", + 1 + ] + ], + "academic record": [ + [ + "Academic_Record", + 1 + ], + [ + "Academic Record", + 1 + ] + ], + "independent study": [ + [ + "Independent Study", + 5 + ], + [ + "Independent_Study", + 2 + ] + ], + "project management": [ + [ + "Project Management", + 7 + ], + [ + "Project_Management", + 1 + ] + ], + "digital fabrication": [ + [ + "Digital Fabrication", + 6 + ], + [ + "digital_fabrication", + 1 + ], + [ + "digital fabrication", + 1 + ] + ], + "project proposal": [ + [ + "Project_Proposal", + 2 + ], + [ + "Project Proposal", + 2 + ] + ], + "academic integrity": [ + [ + "Academic Integrity", + 20 + ], + [ + "Academic_Integrity", + 2 + ] + ], + "3d printing": [ + [ + "3D Printing", + 22 + ], + [ + "3D_Printing", + 7 + ] + ], + "technical skills": [ + [ + "Technical Skills", + 2 + ], + [ + "Technical_Skills", + 1 + ] + ], + "course structure": [ + [ + "Course Structure", + 7 + ], + [ + "Course_Structure", + 1 + ] + ], + "course design": [ + [ + "Course Design", + 9 + ], + [ + "Course_Design", + 1 + ] + ], + "product design": [ + [ + "Product Design", + 6 + ], + [ + "Product_Design", + 1 + ] + ], + "professional experience": [ + [ + "Professional Experience", + 24 + ], + [ + "Professional_Experience", + 6 + ] + ], + "disability accommodations": [ + [ + "Disability Accommodations", + 4 + ], + [ + "Disability_Accommodations", + 1 + ] + ], + "material science": [ + [ + "Material_Science", + 2 + ], + [ + "Material Science", + 4 + ] + ], + "computational design": [ + [ + "Computational Design", + 7 + ], + [ + "Computational_Design", + 1 + ] + ], + "computer services policy": [ + [ + "Computer Services Policy", + 6 + ], + [ + "Computer_Services_Policy", + 1 + ] + ], + "work experience": [ + [ + "Work_Experience", + 1 + ], + [ + "Work Experience", + 3 + ] + ], + "academic program": [ + [ + "Academic Program", + 7 + ], + [ + "Academic_Program", + 1 + ] + ], + "project-based learning": [ + [ + "Project-Based Learning", + 5 + ], + [ + "Project-Based_Learning", + 1 + ], + [ + "Project-based Learning", + 2 + ] + ], + "art and design": [ + [ + "Art and Design", + 6 + ], + [ + "Art_and_Design", + 1 + ] + ], + "fdm technology": [ + [ + "FDM_Technology", + 2 + ], + [ + "FDM Technology", + 1 + ] + ], + "material selection": [ + [ + "Material_Selection", + 1 + ], + [ + "Material Selection", + 1 + ] + ], + "product development": [ + [ + "Product Development", + 6 + ], + [ + "Product_Development", + 2 + ] + ], + "market research": [ + [ + "Market_Research", + 1 + ], + [ + "Market Research", + 2 + ] + ], + "computer services": [ + [ + "Computer Services", + 2 + ], + [ + "Computer_Services", + 1 + ] + ], + "student evaluation of instruction": [ + [ + "Student Evaluation of Instruction", + 1 + ], + [ + "Student_Evaluation_of_Instruction", + 1 + ] + ], + "course management": [ + [ + "Course_Management", + 1 + ], + [ + "Course Management", + 1 + ] + ], + "grade policy": [ + [ + "Grade_Policy", + 1 + ], + [ + "Grade Policy", + 1 + ] + ], + "academic transcript": [ + [ + "Academic_Transcript", + 1 + ], + [ + "Academic Transcript", + 1 + ] + ], + "evaluation criteria": [ + [ + "Evaluation Criteria", + 1 + ], + [ + "Evaluation_Criteria", + 1 + ] + ], + "computer science": [ + [ + "Computer Science", + 2 + ], + [ + "Computer_Science", + 1 + ] + ], + "electrical circuit": [ + [ + "Electrical Circuit", + 2 + ], + [ + "Electrical_Circuit", + 1 + ] + ], + "digital logic": [ + [ + "Digital Logic", + 1 + ], + [ + "Digital_Logic", + 1 + ] + ], + "course description": [ + [ + "Course Description", + 3 + ], + [ + "Course_Description", + 1 + ] + ], + "organizational structure": [ + [ + "Organizational_Structure", + 1 + ], + [ + "Organizational Structure", + 1 + ] + ], + "digital design": [ + [ + "Digital_Design", + 1 + ], + [ + "Digital Design", + 4 + ] + ], + "contact information": [ + [ + "Contact Information", + 2 + ], + [ + "Contact_Information", + 1 + ] + ], + "professional career": [ + [ + "Professional_Career", + 2 + ], + [ + "Professional Career", + 1 + ] + ], + "personal projects": [ + [ + "Personal_Projects", + 1 + ], + [ + "Personal Projects", + 2 + ] + ], + "ai development": [ + [ + "AI_Development", + 1 + ], + [ + "AI Development", + 1 + ] + ], + "university service": [ + [ + "University Service", + 2 + ], + [ + "University_Service", + 1 + ] + ], + "professional exhibitions and publications": [ + [ + "Professional Exhibitions and Publications", + 1 + ], + [ + "Professional_Exhibitions_and_Publications", + 1 + ] + ], + "selected external consulting and design work": [ + [ + "Selected External Consulting and Design Work", + 1 + ], + [ + "Selected_External_Consulting_and_Design_Work", + 2 + ] + ], + "academic career": [ + [ + "Academic_Career", + 1 + ], + [ + "Academic Career", + 2 + ] + ], + "technology integration": [ + [ + "Technology Integration", + 2 + ], + [ + "Technology_Integration", + 1 + ] + ], + "artistic practice": [ + [ + "Artistic_Practice", + 1 + ], + [ + "Artistic Practice", + 1 + ] + ], + "multi-material 3d printing": [ + [ + "Multi-Material 3D Printing", + 1 + ], + [ + "Multi-material 3D Printing", + 1 + ] + ], + "community engagement": [ + [ + "Community Engagement", + 3 + ], + [ + "Community_Engagement", + 1 + ] + ], + "digitaldesignandfabrication": [ + [ + "DigitalDesignAndFabrication", + 1 + ], + [ + "DigitalDesignandFabrication", + 1 + ] + ], + "professional background": [ + [ + "Professional Background", + 3 + ], + [ + "Professional_Background", + 1 + ] + ] + }, + "per_doc_frame_count": { + "3": 282, + "5": 67, + "4": 195, + "2": 57, + "7": 13, + "11": 5, + "13": 2, + "15": 1, + "12": 4, + "6": 21, + "8": 8, + "10": 4, + "9": 6, + "30": 1, + "14": 1, + "18": 1 + }, + "top_30_pairs": [ + { + "a": "Course", + "b": "Education", + "count": 46 + }, + { + "a": "Education", + "b": "Project", + "count": 20 + }, + { + "a": "Design", + "b": "Education", + "count": 20 + }, + { + "a": "Education", + "b": "Professional Experience", + "count": 20 + }, + { + "a": "Education", + "b": "Employment", + "count": 20 + }, + { + "a": "Education", + "b": "Technology", + "count": 18 + }, + { + "a": "Education", + "b": "Grading", + "count": 17 + }, + { + "a": "Education", + "b": "Research", + "count": 15 + }, + { + "a": "Art", + "b": "Education", + "count": 15 + }, + { + "a": "Attendance", + "b": "Grading", + "count": 14 + }, + { + "a": "Course", + "b": "Grading", + "count": 13 + }, + { + "a": "Academic Integrity", + "b": "Education", + "count": 11 + }, + { + "a": "Attendance", + "b": "Education", + "count": 11 + }, + { + "a": "Attendance", + "b": "Course", + "count": 11 + }, + { + "a": "Application", + "b": "Employment", + "count": 11 + }, + { + "a": "Coursework", + "b": "Education", + "count": 10 + }, + { + "a": "Course", + "b": "Design", + "count": 10 + }, + { + "a": "Course", + "b": "Programming", + "count": 10 + }, + { + "a": "Application", + "b": "Education", + "count": 10 + }, + { + "a": "Budget", + "b": "Education", + "count": 10 + }, + { + "a": "Academic Integrity", + "b": "Accommodation", + "count": 9 + }, + { + "a": "Education", + "b": "Teaching", + "count": 9 + }, + { + "a": "Education", + "b": "Programming", + "count": 9 + }, + { + "a": "Academic Integrity", + "b": "Attendance", + "count": 9 + }, + { + "a": "Course", + "b": "Project", + "count": 8 + }, + { + "a": "Research", + "b": "Teaching", + "count": 8 + }, + { + "a": "Grading", + "b": "Project", + "count": 7 + }, + { + "a": "Art", + "b": "Technology", + "count": 7 + }, + { + "a": "Academic Integrity", + "b": "Course", + "count": 7 + }, + { + "a": "Accommodation", + "b": "Course", + "count": 7 + } + ], + "folder_crosstab": { + "Education": { + "pdf": 116, + "docx": 119, + "pptx": 3 + }, + "Course": { + "pdf": 29, + "docx": 29 + }, + "Programming": { + "pptx": 15, + "docx": 10, + "pdf": 12, + "txt": 6 + }, + "Design": { + "pdf": 13, + "docx": 16, + "pptx": 3 + }, + "Professional Experience": { + "docx": 13, + "pdf": 11 + }, + "Employment": { + "pdf": 15, + "docx": 9 + }, + "Research": { + "pdf": 9, + "docx": 13, + "markdown": 1 + }, + "3D Printing": { + "docx": 3, + "pdf": 11, + "pptx": 8 + }, + "Project": { + "pdf": 8, + "docx": 12, + "markdown": 1 + }, + "Grading": { + "pdf": 10, + "docx": 11 + }, + "Art": { + "docx": 11, + "pdf": 9, + "pptx": 1 + }, + "Budget": { + "docx": 6, + "pdf": 15 + }, + "Academic Integrity": { + "docx": 17, + "pdf": 3 + }, + "Teaching": { + "pdf": 9, + "docx": 10 + }, + "Technology": { + "docx": 15, + "pdf": 3 + }, + "Attendance": { + "docx": 11, + "pdf": 6 + }, + "Application": { + "pdf": 13, + "docx": 2 + }, + "Accommodation": { + "docx": 11, + "pdf": 2 + }, + "Manufacturing": { + "docx": 6, + "pptx": 4, + "pdf": 3 + }, + "Coursework": { + "pdf": 8, + "docx": 3 + } + }, + "bin_totals": { + "markdown": 64, + "pdf": 286, + "pptx": 70, + "txt": 28, + "docx": 217, + "dream_output": 3 + }, + "worker_versions": { + "2.0": 3, + "2.1": 665 + }, + "data_gap": { + "count": 339, + "by_type_bin": { + "pdf": 110, + "voice_note": 14, + "docx": 110, + "dream_output": 39, + "pptx": 31, + "txt": 28, + "markdown": 7 + }, + "char_length": { + "min": 6, + "max": 1998, + "median": 1077 + }, + "sample_sources": [ + "Thesis Paper Guidlines.pdf", + "2026-04-30-17-06-voice.md", + "2026-04-30-15-59-voice.md", + "2026-04-30-16-53-voice.md", + "2026-04-30-16-23-voice.md", + "2026-04-29-17-52-voice.md", + "2026-04-30-16-59-voice.md", + "Outline for 3D Printed Materials for Foundry Casting.docx", + "2026-04-26-22-52-voice.md", + "2026-04-30-synthesis.md" + ] + }, + "corpus_coverage": { + "total_distinct_sources_in_embeddings": 1255, + "conversations_no_frames_by_design": 198, + "files_with_frames": 704, + "files_short_no_frames": 339, + "files_stage2_failed": 12, + "frame_coverage_pct": 56.1 + } +} \ No newline at end of file diff --git a/scripts/experiments/frame_distribution_report.py b/scripts/experiments/frame_distribution_report.py new file mode 100644 index 0000000..435576d --- /dev/null +++ b/scripts/experiments/frame_distribution_report.py @@ -0,0 +1,296 @@ +"""Read-only analysis of Stage 2 frame data via stage2_frames_v. + +Produces seven sections (frequency, hygiene, per-doc count, co-occurrence, +folder cross-tab, worker-version split, data-gap accounting) and writes a JSON +sidecar for diffing across runs. + +Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py +""" +import os +import json +import re +import sys +from collections import Counter, defaultdict +from datetime import datetime +from pathlib import Path + +import psycopg2 +from dotenv import load_dotenv + +load_dotenv() + +OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json" +TOP_K = 20 # for co-occurrence; revisit after seeing the long tail + + +def normalize(label): + return re.sub(r"\s+", " ", label.strip().lower().replace("_", " ")) + + +def folder_bin(source): + """Classify source by type. stage_3_queue stores bare filenames, so we + bin by what kind of file it is, not where it lives in the tree.""" + if not source: + return "unknown" + if re.match(r"^(Claude|ChatGPT|Aaron AI):", source): + return "conversation" # bypasses Stage 2/3, will not appear here + s = source.lower() + if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s): + return "voice_note" + if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s): + return "dream_output" + if s.endswith(".md"): + return "markdown" + if s.endswith(".pdf"): + return "pdf" + if s.endswith(".docx") or s.endswith(".doc"): + return "docx" + if s.endswith(".pptx") or s.endswith(".ppt"): + return "pptx" + if s.endswith(".txt"): + return "txt" + return "other" + + +def fetch_rows(cur): + cur.execute(""" + SELECT source, char_length, active_frames, worker_version, raw_metadata + FROM stage2_frames_v + """) + rows = [] + for source, char_length, frames, worker_version, raw in cur.fetchall(): + if not isinstance(frames, list): + continue + rows.append({ + "source": source, + "char_length": char_length, + "frames": [str(f) for f in frames if f], + "worker_version": worker_version, + "raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [], + }) + return rows + + +def section_frequency(rows): + counter = Counter() + for r in rows: + for f in r["frames"]: + counter[f] += 1 + return counter + + +def section_hygiene(frequency): + """Group raw labels by normalized form; flag collisions.""" + groups = defaultdict(list) + for raw, count in frequency.items(): + groups[normalize(raw)].append((raw, count)) + collisions = {k: v for k, v in groups.items() if len(v) > 1} + return collisions + + +def section_per_doc_count(rows): + counts = Counter(len(r["frames"]) for r in rows) + return counts + + +def section_cooccurrence(rows, top_frames): + top_set = set(top_frames) + pair_counts = Counter() + for r in rows: + present = [f for f in r["frames"] if f in top_set] + for i in range(len(present)): + for j in range(i + 1, len(present)): + a, b = sorted([present[i], present[j]]) + pair_counts[(a, b)] += 1 + return pair_counts + + +def section_folder_crosstab(rows, top_frames): + top_set = set(top_frames) + table = defaultdict(Counter) # frame -> bin -> count + bin_totals = Counter() + for r in rows: + b = folder_bin(r["source"]) + bin_totals[b] += 1 + for f in r["frames"]: + if f in top_set: + table[f][b] += 1 + return table, bin_totals + + +def section_worker_versions(rows): + counter = Counter(r["worker_version"] or "unknown" for r in rows) + raw_keys_by_version = defaultdict(Counter) + for r in rows: + v = r["worker_version"] or "unknown" + raw_keys_by_version[v][tuple(r["raw_keys"])] += 1 + return counter, raw_keys_by_version + + +def section_data_gap(cur): + """Docs that completed Stage 2 but never had frames extracted (<2000 chars).""" + cur.execute(""" + SELECT source, char_length + FROM stage_2_queue + WHERE completed_at IS NOT NULL AND char_length < 2000 + """) + missing = cur.fetchall() + by_bin = Counter(folder_bin(s) for s, _ in missing) + char_lengths = [c for _, c in missing] + return { + "count": len(missing), + "by_type_bin": dict(by_bin), + "char_length": { + "min": min(char_lengths) if char_lengths else None, + "max": max(char_lengths) if char_lengths else None, + "median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None, + }, + "sample_sources": [s for s, _ in missing[:10]], + } + + +def section_corpus_coverage(cur): + """How much of the embeddings corpus has frame coverage?""" + cur.execute("SELECT count(DISTINCT source) FROM embeddings") + total = cur.fetchone()[0] + cur.execute(""" + SELECT count(DISTINCT source) FROM embeddings + WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%' + OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation' + """) + conversations = cur.fetchone()[0] + cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL") + with_frames = cur.fetchone()[0] + cur.execute(""" + SELECT count(DISTINCT source) FROM stage_2_queue + WHERE completed_at IS NOT NULL AND char_length < 2000 + """) + short_no_frames = cur.fetchone()[0] + cur.execute(""" + SELECT count(DISTINCT source) FROM stage_2_queue + WHERE failed_at IS NOT NULL + """) + failed = cur.fetchone()[0] + return { + "total_distinct_sources_in_embeddings": total, + "conversations_no_frames_by_design": conversations, + "files_with_frames": with_frames, + "files_short_no_frames": short_no_frames, + "files_stage2_failed": failed, + "frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1), + } + + +def main(): + conn = psycopg2.connect(os.environ["PG_DSN"]) + cur = conn.cursor() + + rows = fetch_rows(cur) + n_docs = len(rows) + print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n") + + # 1. Frequency + freq = section_frequency(rows) + print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---") + for label, count in freq.most_common(30): + print(f" {count:5d} {label}") + print() + + # 2. Hygiene + collisions = section_hygiene(freq) + print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---") + for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])): + variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1])) + print(f" '{norm}': {variant_str}") + print() + + # 3. Per-doc frame count + per_doc = section_per_doc_count(rows) + print("--- 3. Per-doc frame count ---") + for n in sorted(per_doc): + print(f" {n} frames: {per_doc[n]} docs") + print() + + # 4. Co-occurrence (top-K) + top_frames = [f for f, _ in freq.most_common(TOP_K)] + pairs = section_cooccurrence(rows, top_frames) + print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---") + for (a, b), count in pairs.most_common(30): + print(f" {count:4d} {a} × {b}") + print() + + # 5. Folder cross-tab + crosstab, bin_totals = section_folder_crosstab(rows, top_frames) + print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---") + bins_sorted = [b for b, _ in bin_totals.most_common()] + print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10))) + for f in top_frames: + row_data = crosstab[f] + if not row_data: + continue + cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5)) + print(f" {f}: {cells}") + print() + + # 6. Worker versions + versions, keys_by_version = section_worker_versions(rows) + print("--- 6. Worker version split ---") + for v, count in versions.most_common(): + print(f" v{v}: {count} docs") + top_shapes = keys_by_version[v].most_common(3) + for keys, kcount in top_shapes: + print(f" {kcount} docs with keys={list(keys)}") + print() + + # 7. Data gap + gap = section_data_gap(cur) + print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---") + print(f" count: {gap['count']}") + print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}") + print(f" by type bin: {gap['by_type_bin']}") + print(f" sample sources: {gap['sample_sources']}") + print() + + # 8. Corpus coverage + coverage = section_corpus_coverage(cur) + print("--- 8. Corpus-wide frame coverage ---") + print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}") + print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}") + print(f" files with frames: {coverage['files_with_frames']}") + print(f" files short, no frames: {coverage['files_short_no_frames']}") + print(f" files Stage 2 failed: {coverage['files_stage2_failed']}") + print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus") + print() + + # JSON sidecar + OUT_PATH.parent.mkdir(parents=True, exist_ok=True) + sidecar = { + "generated_at": datetime.now().isoformat(), + "n_docs_with_frames": n_docs, + "n_distinct_labels": len(freq), + "top_30_frames": freq.most_common(30), + "label_collisions": { + k: [(r, c) for r, c in v] for k, v in collisions.items() + }, + "per_doc_frame_count": dict(per_doc), + "top_30_pairs": [ + {"a": a, "b": b, "count": c} + for (a, b), c in pairs.most_common(30) + ], + "folder_crosstab": { + f: dict(crosstab[f]) for f in top_frames if crosstab[f] + }, + "bin_totals": dict(bin_totals), + "worker_versions": dict(versions), + "data_gap": gap, + "corpus_coverage": coverage, + } + OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str)) + print(f"JSON sidecar written: {OUT_PATH}") + + cur.close() + conn.close() + + +if __name__ == "__main__": + main()