experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3)
Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only.
This commit is contained in:
@@ -0,0 +1,175 @@
|
|||||||
|
# Stage 2 Frame Analysis — 2026-05-03
|
||||||
|
|
||||||
|
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
|
||||||
|
|
||||||
|
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verdict
|
||||||
|
|
||||||
|
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
|
||||||
|
|
||||||
|
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Corpus-wide frame coverage
|
||||||
|
|
||||||
|
| Class | Count | % of corpus | Frame coverage |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
|
||||||
|
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
|
||||||
|
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
|
||||||
|
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
|
||||||
|
| Files that failed Stage 2 | 12 | 1.0% | none |
|
||||||
|
|
||||||
|
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
|
||||||
|
|
||||||
|
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
|
||||||
|
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
|
||||||
|
|
||||||
|
## 2. Frame distribution (the docs that DO have frames)
|
||||||
|
|
||||||
|
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
|
||||||
|
|
||||||
|
| Frame | Count | % of frame-extracted docs |
|
||||||
|
|---|---|---|
|
||||||
|
| Education | 238 | 35.6% |
|
||||||
|
| Course | 58 | 8.7% |
|
||||||
|
| Programming | 43 | 6.4% |
|
||||||
|
| Design | 32 | 4.8% |
|
||||||
|
| Professional Experience | 24 | 3.6% |
|
||||||
|
| Employment | 24 | 3.6% |
|
||||||
|
| Research | 23 | 3.4% |
|
||||||
|
| 3D Printing | 22 | 3.3% |
|
||||||
|
| Project, Grading, Art, Budget | 21 each | 3.1% |
|
||||||
|
| Academic Integrity | 20 | 3.0% |
|
||||||
|
| Teaching, Technology, Attendance, Application | 13–19 | — |
|
||||||
|
| Accommodation, Manufacturing, Coursework, Recommendation | 10–13 | — |
|
||||||
|
|
||||||
|
**Per-doc frame count:** median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented).
|
||||||
|
|
||||||
|
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
|
||||||
|
|
||||||
|
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
|
||||||
|
|
||||||
|
## 3. Label hygiene
|
||||||
|
|
||||||
|
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
|
||||||
|
|
||||||
|
| Concept | Variant counts |
|
||||||
|
|---|---|
|
||||||
|
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
|
||||||
|
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
|
||||||
|
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
|
||||||
|
| Course Design | `Course Design`:9 + `Course_Design`:1 |
|
||||||
|
| Project Management | `Project Management`:7 + `Project_Management`:1 |
|
||||||
|
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
|
||||||
|
| (… 48 more) | |
|
||||||
|
|
||||||
|
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
|
||||||
|
|
||||||
|
## 4. Worker version drift
|
||||||
|
|
||||||
|
| Worker version | Doc count | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
|
||||||
|
| v2.0 | 3 | Same key shape as v2.1 baseline. |
|
||||||
|
|
||||||
|
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
|
||||||
|
|
||||||
|
## 5. File-type signal
|
||||||
|
|
||||||
|
This is the most useful Track 2 finding from this report.
|
||||||
|
|
||||||
|
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
|
||||||
|
|
||||||
|
| Frame | pdf | docx | pptx | markdown | txt | dream |
|
||||||
|
|---|---|---|---|---|---|---|
|
||||||
|
| Education | 116 | 119 | 3 | — | — | — |
|
||||||
|
| Course | 29 | 29 | — | — | — | — |
|
||||||
|
| Programming | 12 | 10 | **15** | — | 6 | — |
|
||||||
|
| Application | **13** | 2 | — | — | — | — |
|
||||||
|
| 3D Printing | 11 | 3 | **8** | — | — | — |
|
||||||
|
| Manufacturing | 3 | 6 | 4 | — | — | — |
|
||||||
|
| Research | 9 | 13 | — | 1 | — | — |
|
||||||
|
|
||||||
|
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
|
||||||
|
|
||||||
|
## 6. Systematic exclusions inside the 339-doc gap
|
||||||
|
|
||||||
|
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
|
||||||
|
|
||||||
|
| Type | Count | What this is |
|
||||||
|
|---|---|---|
|
||||||
|
| pdf | 110 | Short PDFs (forms, single-page docs) |
|
||||||
|
| docx | 110 | Short Word docs |
|
||||||
|
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
|
||||||
|
| pptx | 31 | Short slide decks |
|
||||||
|
| txt | 28 | Plain-text files |
|
||||||
|
| voice_note | 14 | **Every voice note in the corpus** |
|
||||||
|
| markdown | 7 | Short markdown |
|
||||||
|
|
||||||
|
**Two specific systematic exclusions worth naming separately:**
|
||||||
|
|
||||||
|
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
|
||||||
|
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
|
||||||
|
|
||||||
|
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
|
||||||
|
|
||||||
|
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
|
||||||
|
|
||||||
|
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
|
||||||
|
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
|
||||||
|
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
|
||||||
|
|
||||||
|
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
|
||||||
|
|
||||||
|
**Defined scope (the coverage caveat):**
|
||||||
|
|
||||||
|
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
|
||||||
|
|
||||||
|
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
|
||||||
|
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
|
||||||
|
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
|
||||||
|
|
||||||
|
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Recommended follow-ups (ordered by ROI)
|
||||||
|
|
||||||
|
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
|
||||||
|
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
|
||||||
|
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
|
||||||
|
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
|
||||||
|
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
|
||||||
|
|
||||||
|
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Inventory edits flagged for session-end batch
|
||||||
|
|
||||||
|
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
|
||||||
|
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
|
||||||
|
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
|
||||||
|
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
|
||||||
|
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
|
||||||
|
|
||||||
|
## 10. Reproduction
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/aaronai
|
||||||
|
venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||||
|
# stdout: human-readable report
|
||||||
|
# json: experiments/frame_distribution_<date>.json
|
||||||
|
# view: stage2_frames_v (in pgvector DB)
|
||||||
|
```
|
||||||
|
|
||||||
|
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
|
||||||
@@ -0,0 +1,987 @@
|
|||||||
|
{
|
||||||
|
"generated_at": "2026-05-03T20:21:33.558462",
|
||||||
|
"n_docs_with_frames": 668,
|
||||||
|
"n_distinct_labels": 1374,
|
||||||
|
"top_30_frames": [
|
||||||
|
[
|
||||||
|
"Education",
|
||||||
|
238
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course",
|
||||||
|
58
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Programming",
|
||||||
|
43
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Design",
|
||||||
|
32
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Professional Experience",
|
||||||
|
24
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Employment",
|
||||||
|
24
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Research",
|
||||||
|
23
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"3D Printing",
|
||||||
|
22
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Project",
|
||||||
|
21
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Grading",
|
||||||
|
21
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Art",
|
||||||
|
21
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Budget",
|
||||||
|
21
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic Integrity",
|
||||||
|
20
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Teaching",
|
||||||
|
19
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Technology",
|
||||||
|
18
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Attendance",
|
||||||
|
17
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Application",
|
||||||
|
15
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Accommodation",
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Manufacturing",
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Coursework",
|
||||||
|
11
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Recommendation",
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Manufacturing Process",
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Additive Manufacturing",
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Job Application",
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Exhibitions",
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic Administration",
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Communication",
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course Design",
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Veteran and Military Services",
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Career",
|
||||||
|
9
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"label_collisions": {
|
||||||
|
"conversational": [
|
||||||
|
[
|
||||||
|
"Conversational",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"conversational",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"content": [
|
||||||
|
[
|
||||||
|
"Content",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"content",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cascade": [
|
||||||
|
[
|
||||||
|
"Cascade",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"cascade",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"education": [
|
||||||
|
[
|
||||||
|
"Education",
|
||||||
|
238
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"education",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"academic record": [
|
||||||
|
[
|
||||||
|
"Academic_Record",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic Record",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"independent study": [
|
||||||
|
[
|
||||||
|
"Independent Study",
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Independent_Study",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"project management": [
|
||||||
|
[
|
||||||
|
"Project Management",
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Project_Management",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"digital fabrication": [
|
||||||
|
[
|
||||||
|
"Digital Fabrication",
|
||||||
|
6
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"digital_fabrication",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"digital fabrication",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"project proposal": [
|
||||||
|
[
|
||||||
|
"Project_Proposal",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Project Proposal",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"academic integrity": [
|
||||||
|
[
|
||||||
|
"Academic Integrity",
|
||||||
|
20
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic_Integrity",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"3d printing": [
|
||||||
|
[
|
||||||
|
"3D Printing",
|
||||||
|
22
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"3D_Printing",
|
||||||
|
7
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"technical skills": [
|
||||||
|
[
|
||||||
|
"Technical Skills",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Technical_Skills",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"course structure": [
|
||||||
|
[
|
||||||
|
"Course Structure",
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course_Structure",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"course design": [
|
||||||
|
[
|
||||||
|
"Course Design",
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course_Design",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"product design": [
|
||||||
|
[
|
||||||
|
"Product Design",
|
||||||
|
6
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Product_Design",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"professional experience": [
|
||||||
|
[
|
||||||
|
"Professional Experience",
|
||||||
|
24
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Professional_Experience",
|
||||||
|
6
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"disability accommodations": [
|
||||||
|
[
|
||||||
|
"Disability Accommodations",
|
||||||
|
4
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Disability_Accommodations",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"material science": [
|
||||||
|
[
|
||||||
|
"Material_Science",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Material Science",
|
||||||
|
4
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"computational design": [
|
||||||
|
[
|
||||||
|
"Computational Design",
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Computational_Design",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"computer services policy": [
|
||||||
|
[
|
||||||
|
"Computer Services Policy",
|
||||||
|
6
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Computer_Services_Policy",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"work experience": [
|
||||||
|
[
|
||||||
|
"Work_Experience",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Work Experience",
|
||||||
|
3
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"academic program": [
|
||||||
|
[
|
||||||
|
"Academic Program",
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic_Program",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"project-based learning": [
|
||||||
|
[
|
||||||
|
"Project-Based Learning",
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Project-Based_Learning",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Project-based Learning",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"art and design": [
|
||||||
|
[
|
||||||
|
"Art and Design",
|
||||||
|
6
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Art_and_Design",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"fdm technology": [
|
||||||
|
[
|
||||||
|
"FDM_Technology",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"FDM Technology",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"material selection": [
|
||||||
|
[
|
||||||
|
"Material_Selection",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Material Selection",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"product development": [
|
||||||
|
[
|
||||||
|
"Product Development",
|
||||||
|
6
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Product_Development",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"market research": [
|
||||||
|
[
|
||||||
|
"Market_Research",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Market Research",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"computer services": [
|
||||||
|
[
|
||||||
|
"Computer Services",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Computer_Services",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"student evaluation of instruction": [
|
||||||
|
[
|
||||||
|
"Student Evaluation of Instruction",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Student_Evaluation_of_Instruction",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"course management": [
|
||||||
|
[
|
||||||
|
"Course_Management",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course Management",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"grade policy": [
|
||||||
|
[
|
||||||
|
"Grade_Policy",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Grade Policy",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"academic transcript": [
|
||||||
|
[
|
||||||
|
"Academic_Transcript",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic Transcript",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"evaluation criteria": [
|
||||||
|
[
|
||||||
|
"Evaluation Criteria",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Evaluation_Criteria",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"computer science": [
|
||||||
|
[
|
||||||
|
"Computer Science",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Computer_Science",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"electrical circuit": [
|
||||||
|
[
|
||||||
|
"Electrical Circuit",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Electrical_Circuit",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"digital logic": [
|
||||||
|
[
|
||||||
|
"Digital Logic",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Digital_Logic",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"course description": [
|
||||||
|
[
|
||||||
|
"Course Description",
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Course_Description",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"organizational structure": [
|
||||||
|
[
|
||||||
|
"Organizational_Structure",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Organizational Structure",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"digital design": [
|
||||||
|
[
|
||||||
|
"Digital_Design",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Digital Design",
|
||||||
|
4
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"contact information": [
|
||||||
|
[
|
||||||
|
"Contact Information",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Contact_Information",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"professional career": [
|
||||||
|
[
|
||||||
|
"Professional_Career",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Professional Career",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"personal projects": [
|
||||||
|
[
|
||||||
|
"Personal_Projects",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Personal Projects",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"ai development": [
|
||||||
|
[
|
||||||
|
"AI_Development",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"AI Development",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"university service": [
|
||||||
|
[
|
||||||
|
"University Service",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"University_Service",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"professional exhibitions and publications": [
|
||||||
|
[
|
||||||
|
"Professional Exhibitions and Publications",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Professional_Exhibitions_and_Publications",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"selected external consulting and design work": [
|
||||||
|
[
|
||||||
|
"Selected External Consulting and Design Work",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Selected_External_Consulting_and_Design_Work",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"academic career": [
|
||||||
|
[
|
||||||
|
"Academic_Career",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Academic Career",
|
||||||
|
2
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"technology integration": [
|
||||||
|
[
|
||||||
|
"Technology Integration",
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Technology_Integration",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"artistic practice": [
|
||||||
|
[
|
||||||
|
"Artistic_Practice",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Artistic Practice",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"multi-material 3d printing": [
|
||||||
|
[
|
||||||
|
"Multi-Material 3D Printing",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Multi-material 3D Printing",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"community engagement": [
|
||||||
|
[
|
||||||
|
"Community Engagement",
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Community_Engagement",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"digitaldesignandfabrication": [
|
||||||
|
[
|
||||||
|
"DigitalDesignAndFabrication",
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"DigitalDesignandFabrication",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"professional background": [
|
||||||
|
[
|
||||||
|
"Professional Background",
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Professional_Background",
|
||||||
|
1
|
||||||
|
]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"per_doc_frame_count": {
|
||||||
|
"3": 282,
|
||||||
|
"5": 67,
|
||||||
|
"4": 195,
|
||||||
|
"2": 57,
|
||||||
|
"7": 13,
|
||||||
|
"11": 5,
|
||||||
|
"13": 2,
|
||||||
|
"15": 1,
|
||||||
|
"12": 4,
|
||||||
|
"6": 21,
|
||||||
|
"8": 8,
|
||||||
|
"10": 4,
|
||||||
|
"9": 6,
|
||||||
|
"30": 1,
|
||||||
|
"14": 1,
|
||||||
|
"18": 1
|
||||||
|
},
|
||||||
|
"top_30_pairs": [
|
||||||
|
{
|
||||||
|
"a": "Course",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 46
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Project",
|
||||||
|
"count": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Design",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Professional Experience",
|
||||||
|
"count": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Employment",
|
||||||
|
"count": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Technology",
|
||||||
|
"count": 18
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Grading",
|
||||||
|
"count": 17
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Research",
|
||||||
|
"count": 15
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Art",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 15
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Attendance",
|
||||||
|
"b": "Grading",
|
||||||
|
"count": 14
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Course",
|
||||||
|
"b": "Grading",
|
||||||
|
"count": 13
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Academic Integrity",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Attendance",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Attendance",
|
||||||
|
"b": "Course",
|
||||||
|
"count": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Application",
|
||||||
|
"b": "Employment",
|
||||||
|
"count": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Coursework",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Course",
|
||||||
|
"b": "Design",
|
||||||
|
"count": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Course",
|
||||||
|
"b": "Programming",
|
||||||
|
"count": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Application",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Budget",
|
||||||
|
"b": "Education",
|
||||||
|
"count": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Academic Integrity",
|
||||||
|
"b": "Accommodation",
|
||||||
|
"count": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Teaching",
|
||||||
|
"count": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Education",
|
||||||
|
"b": "Programming",
|
||||||
|
"count": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Academic Integrity",
|
||||||
|
"b": "Attendance",
|
||||||
|
"count": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Course",
|
||||||
|
"b": "Project",
|
||||||
|
"count": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Research",
|
||||||
|
"b": "Teaching",
|
||||||
|
"count": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Grading",
|
||||||
|
"b": "Project",
|
||||||
|
"count": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Art",
|
||||||
|
"b": "Technology",
|
||||||
|
"count": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Academic Integrity",
|
||||||
|
"b": "Course",
|
||||||
|
"count": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"a": "Accommodation",
|
||||||
|
"b": "Course",
|
||||||
|
"count": 7
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"folder_crosstab": {
|
||||||
|
"Education": {
|
||||||
|
"pdf": 116,
|
||||||
|
"docx": 119,
|
||||||
|
"pptx": 3
|
||||||
|
},
|
||||||
|
"Course": {
|
||||||
|
"pdf": 29,
|
||||||
|
"docx": 29
|
||||||
|
},
|
||||||
|
"Programming": {
|
||||||
|
"pptx": 15,
|
||||||
|
"docx": 10,
|
||||||
|
"pdf": 12,
|
||||||
|
"txt": 6
|
||||||
|
},
|
||||||
|
"Design": {
|
||||||
|
"pdf": 13,
|
||||||
|
"docx": 16,
|
||||||
|
"pptx": 3
|
||||||
|
},
|
||||||
|
"Professional Experience": {
|
||||||
|
"docx": 13,
|
||||||
|
"pdf": 11
|
||||||
|
},
|
||||||
|
"Employment": {
|
||||||
|
"pdf": 15,
|
||||||
|
"docx": 9
|
||||||
|
},
|
||||||
|
"Research": {
|
||||||
|
"pdf": 9,
|
||||||
|
"docx": 13,
|
||||||
|
"markdown": 1
|
||||||
|
},
|
||||||
|
"3D Printing": {
|
||||||
|
"docx": 3,
|
||||||
|
"pdf": 11,
|
||||||
|
"pptx": 8
|
||||||
|
},
|
||||||
|
"Project": {
|
||||||
|
"pdf": 8,
|
||||||
|
"docx": 12,
|
||||||
|
"markdown": 1
|
||||||
|
},
|
||||||
|
"Grading": {
|
||||||
|
"pdf": 10,
|
||||||
|
"docx": 11
|
||||||
|
},
|
||||||
|
"Art": {
|
||||||
|
"docx": 11,
|
||||||
|
"pdf": 9,
|
||||||
|
"pptx": 1
|
||||||
|
},
|
||||||
|
"Budget": {
|
||||||
|
"docx": 6,
|
||||||
|
"pdf": 15
|
||||||
|
},
|
||||||
|
"Academic Integrity": {
|
||||||
|
"docx": 17,
|
||||||
|
"pdf": 3
|
||||||
|
},
|
||||||
|
"Teaching": {
|
||||||
|
"pdf": 9,
|
||||||
|
"docx": 10
|
||||||
|
},
|
||||||
|
"Technology": {
|
||||||
|
"docx": 15,
|
||||||
|
"pdf": 3
|
||||||
|
},
|
||||||
|
"Attendance": {
|
||||||
|
"docx": 11,
|
||||||
|
"pdf": 6
|
||||||
|
},
|
||||||
|
"Application": {
|
||||||
|
"pdf": 13,
|
||||||
|
"docx": 2
|
||||||
|
},
|
||||||
|
"Accommodation": {
|
||||||
|
"docx": 11,
|
||||||
|
"pdf": 2
|
||||||
|
},
|
||||||
|
"Manufacturing": {
|
||||||
|
"docx": 6,
|
||||||
|
"pptx": 4,
|
||||||
|
"pdf": 3
|
||||||
|
},
|
||||||
|
"Coursework": {
|
||||||
|
"pdf": 8,
|
||||||
|
"docx": 3
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"bin_totals": {
|
||||||
|
"markdown": 64,
|
||||||
|
"pdf": 286,
|
||||||
|
"pptx": 70,
|
||||||
|
"txt": 28,
|
||||||
|
"docx": 217,
|
||||||
|
"dream_output": 3
|
||||||
|
},
|
||||||
|
"worker_versions": {
|
||||||
|
"2.0": 3,
|
||||||
|
"2.1": 665
|
||||||
|
},
|
||||||
|
"data_gap": {
|
||||||
|
"count": 339,
|
||||||
|
"by_type_bin": {
|
||||||
|
"pdf": 110,
|
||||||
|
"voice_note": 14,
|
||||||
|
"docx": 110,
|
||||||
|
"dream_output": 39,
|
||||||
|
"pptx": 31,
|
||||||
|
"txt": 28,
|
||||||
|
"markdown": 7
|
||||||
|
},
|
||||||
|
"char_length": {
|
||||||
|
"min": 6,
|
||||||
|
"max": 1998,
|
||||||
|
"median": 1077
|
||||||
|
},
|
||||||
|
"sample_sources": [
|
||||||
|
"Thesis Paper Guidlines.pdf",
|
||||||
|
"2026-04-30-17-06-voice.md",
|
||||||
|
"2026-04-30-15-59-voice.md",
|
||||||
|
"2026-04-30-16-53-voice.md",
|
||||||
|
"2026-04-30-16-23-voice.md",
|
||||||
|
"2026-04-29-17-52-voice.md",
|
||||||
|
"2026-04-30-16-59-voice.md",
|
||||||
|
"Outline for 3D Printed Materials for Foundry Casting.docx",
|
||||||
|
"2026-04-26-22-52-voice.md",
|
||||||
|
"2026-04-30-synthesis.md"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"corpus_coverage": {
|
||||||
|
"total_distinct_sources_in_embeddings": 1255,
|
||||||
|
"conversations_no_frames_by_design": 198,
|
||||||
|
"files_with_frames": 704,
|
||||||
|
"files_short_no_frames": 339,
|
||||||
|
"files_stage2_failed": 12,
|
||||||
|
"frame_coverage_pct": 56.1
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,296 @@
|
|||||||
|
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
|
||||||
|
|
||||||
|
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
|
||||||
|
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
|
||||||
|
sidecar for diffing across runs.
|
||||||
|
|
||||||
|
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import psycopg2
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
|
||||||
|
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
|
||||||
|
|
||||||
|
|
||||||
|
def normalize(label):
|
||||||
|
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
|
||||||
|
|
||||||
|
|
||||||
|
def folder_bin(source):
|
||||||
|
"""Classify source by type. stage_3_queue stores bare filenames, so we
|
||||||
|
bin by what kind of file it is, not where it lives in the tree."""
|
||||||
|
if not source:
|
||||||
|
return "unknown"
|
||||||
|
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
|
||||||
|
return "conversation" # bypasses Stage 2/3, will not appear here
|
||||||
|
s = source.lower()
|
||||||
|
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
|
||||||
|
return "voice_note"
|
||||||
|
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
|
||||||
|
return "dream_output"
|
||||||
|
if s.endswith(".md"):
|
||||||
|
return "markdown"
|
||||||
|
if s.endswith(".pdf"):
|
||||||
|
return "pdf"
|
||||||
|
if s.endswith(".docx") or s.endswith(".doc"):
|
||||||
|
return "docx"
|
||||||
|
if s.endswith(".pptx") or s.endswith(".ppt"):
|
||||||
|
return "pptx"
|
||||||
|
if s.endswith(".txt"):
|
||||||
|
return "txt"
|
||||||
|
return "other"
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows(cur):
|
||||||
|
cur.execute("""
|
||||||
|
SELECT source, char_length, active_frames, worker_version, raw_metadata
|
||||||
|
FROM stage2_frames_v
|
||||||
|
""")
|
||||||
|
rows = []
|
||||||
|
for source, char_length, frames, worker_version, raw in cur.fetchall():
|
||||||
|
if not isinstance(frames, list):
|
||||||
|
continue
|
||||||
|
rows.append({
|
||||||
|
"source": source,
|
||||||
|
"char_length": char_length,
|
||||||
|
"frames": [str(f) for f in frames if f],
|
||||||
|
"worker_version": worker_version,
|
||||||
|
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
|
||||||
|
})
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def section_frequency(rows):
|
||||||
|
counter = Counter()
|
||||||
|
for r in rows:
|
||||||
|
for f in r["frames"]:
|
||||||
|
counter[f] += 1
|
||||||
|
return counter
|
||||||
|
|
||||||
|
|
||||||
|
def section_hygiene(frequency):
|
||||||
|
"""Group raw labels by normalized form; flag collisions."""
|
||||||
|
groups = defaultdict(list)
|
||||||
|
for raw, count in frequency.items():
|
||||||
|
groups[normalize(raw)].append((raw, count))
|
||||||
|
collisions = {k: v for k, v in groups.items() if len(v) > 1}
|
||||||
|
return collisions
|
||||||
|
|
||||||
|
|
||||||
|
def section_per_doc_count(rows):
|
||||||
|
counts = Counter(len(r["frames"]) for r in rows)
|
||||||
|
return counts
|
||||||
|
|
||||||
|
|
||||||
|
def section_cooccurrence(rows, top_frames):
|
||||||
|
top_set = set(top_frames)
|
||||||
|
pair_counts = Counter()
|
||||||
|
for r in rows:
|
||||||
|
present = [f for f in r["frames"] if f in top_set]
|
||||||
|
for i in range(len(present)):
|
||||||
|
for j in range(i + 1, len(present)):
|
||||||
|
a, b = sorted([present[i], present[j]])
|
||||||
|
pair_counts[(a, b)] += 1
|
||||||
|
return pair_counts
|
||||||
|
|
||||||
|
|
||||||
|
def section_folder_crosstab(rows, top_frames):
|
||||||
|
top_set = set(top_frames)
|
||||||
|
table = defaultdict(Counter) # frame -> bin -> count
|
||||||
|
bin_totals = Counter()
|
||||||
|
for r in rows:
|
||||||
|
b = folder_bin(r["source"])
|
||||||
|
bin_totals[b] += 1
|
||||||
|
for f in r["frames"]:
|
||||||
|
if f in top_set:
|
||||||
|
table[f][b] += 1
|
||||||
|
return table, bin_totals
|
||||||
|
|
||||||
|
|
||||||
|
def section_worker_versions(rows):
|
||||||
|
counter = Counter(r["worker_version"] or "unknown" for r in rows)
|
||||||
|
raw_keys_by_version = defaultdict(Counter)
|
||||||
|
for r in rows:
|
||||||
|
v = r["worker_version"] or "unknown"
|
||||||
|
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
|
||||||
|
return counter, raw_keys_by_version
|
||||||
|
|
||||||
|
|
||||||
|
def section_data_gap(cur):
|
||||||
|
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
|
||||||
|
cur.execute("""
|
||||||
|
SELECT source, char_length
|
||||||
|
FROM stage_2_queue
|
||||||
|
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||||
|
""")
|
||||||
|
missing = cur.fetchall()
|
||||||
|
by_bin = Counter(folder_bin(s) for s, _ in missing)
|
||||||
|
char_lengths = [c for _, c in missing]
|
||||||
|
return {
|
||||||
|
"count": len(missing),
|
||||||
|
"by_type_bin": dict(by_bin),
|
||||||
|
"char_length": {
|
||||||
|
"min": min(char_lengths) if char_lengths else None,
|
||||||
|
"max": max(char_lengths) if char_lengths else None,
|
||||||
|
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
|
||||||
|
},
|
||||||
|
"sample_sources": [s for s, _ in missing[:10]],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def section_corpus_coverage(cur):
|
||||||
|
"""How much of the embeddings corpus has frame coverage?"""
|
||||||
|
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
|
||||||
|
total = cur.fetchone()[0]
|
||||||
|
cur.execute("""
|
||||||
|
SELECT count(DISTINCT source) FROM embeddings
|
||||||
|
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
|
||||||
|
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
|
||||||
|
""")
|
||||||
|
conversations = cur.fetchone()[0]
|
||||||
|
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
|
||||||
|
with_frames = cur.fetchone()[0]
|
||||||
|
cur.execute("""
|
||||||
|
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||||
|
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||||
|
""")
|
||||||
|
short_no_frames = cur.fetchone()[0]
|
||||||
|
cur.execute("""
|
||||||
|
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||||
|
WHERE failed_at IS NOT NULL
|
||||||
|
""")
|
||||||
|
failed = cur.fetchone()[0]
|
||||||
|
return {
|
||||||
|
"total_distinct_sources_in_embeddings": total,
|
||||||
|
"conversations_no_frames_by_design": conversations,
|
||||||
|
"files_with_frames": with_frames,
|
||||||
|
"files_short_no_frames": short_no_frames,
|
||||||
|
"files_stage2_failed": failed,
|
||||||
|
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
conn = psycopg2.connect(os.environ["PG_DSN"])
|
||||||
|
cur = conn.cursor()
|
||||||
|
|
||||||
|
rows = fetch_rows(cur)
|
||||||
|
n_docs = len(rows)
|
||||||
|
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
|
||||||
|
|
||||||
|
# 1. Frequency
|
||||||
|
freq = section_frequency(rows)
|
||||||
|
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
|
||||||
|
for label, count in freq.most_common(30):
|
||||||
|
print(f" {count:5d} {label}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 2. Hygiene
|
||||||
|
collisions = section_hygiene(freq)
|
||||||
|
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
|
||||||
|
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
|
||||||
|
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
|
||||||
|
print(f" '{norm}': {variant_str}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 3. Per-doc frame count
|
||||||
|
per_doc = section_per_doc_count(rows)
|
||||||
|
print("--- 3. Per-doc frame count ---")
|
||||||
|
for n in sorted(per_doc):
|
||||||
|
print(f" {n} frames: {per_doc[n]} docs")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 4. Co-occurrence (top-K)
|
||||||
|
top_frames = [f for f, _ in freq.most_common(TOP_K)]
|
||||||
|
pairs = section_cooccurrence(rows, top_frames)
|
||||||
|
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
|
||||||
|
for (a, b), count in pairs.most_common(30):
|
||||||
|
print(f" {count:4d} {a} × {b}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 5. Folder cross-tab
|
||||||
|
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
|
||||||
|
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
|
||||||
|
bins_sorted = [b for b, _ in bin_totals.most_common()]
|
||||||
|
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
|
||||||
|
for f in top_frames:
|
||||||
|
row_data = crosstab[f]
|
||||||
|
if not row_data:
|
||||||
|
continue
|
||||||
|
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
|
||||||
|
print(f" {f}: {cells}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 6. Worker versions
|
||||||
|
versions, keys_by_version = section_worker_versions(rows)
|
||||||
|
print("--- 6. Worker version split ---")
|
||||||
|
for v, count in versions.most_common():
|
||||||
|
print(f" v{v}: {count} docs")
|
||||||
|
top_shapes = keys_by_version[v].most_common(3)
|
||||||
|
for keys, kcount in top_shapes:
|
||||||
|
print(f" {kcount} docs with keys={list(keys)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 7. Data gap
|
||||||
|
gap = section_data_gap(cur)
|
||||||
|
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
|
||||||
|
print(f" count: {gap['count']}")
|
||||||
|
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
|
||||||
|
print(f" by type bin: {gap['by_type_bin']}")
|
||||||
|
print(f" sample sources: {gap['sample_sources']}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 8. Corpus coverage
|
||||||
|
coverage = section_corpus_coverage(cur)
|
||||||
|
print("--- 8. Corpus-wide frame coverage ---")
|
||||||
|
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
|
||||||
|
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
|
||||||
|
print(f" files with frames: {coverage['files_with_frames']}")
|
||||||
|
print(f" files short, no frames: {coverage['files_short_no_frames']}")
|
||||||
|
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
|
||||||
|
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# JSON sidecar
|
||||||
|
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
sidecar = {
|
||||||
|
"generated_at": datetime.now().isoformat(),
|
||||||
|
"n_docs_with_frames": n_docs,
|
||||||
|
"n_distinct_labels": len(freq),
|
||||||
|
"top_30_frames": freq.most_common(30),
|
||||||
|
"label_collisions": {
|
||||||
|
k: [(r, c) for r, c in v] for k, v in collisions.items()
|
||||||
|
},
|
||||||
|
"per_doc_frame_count": dict(per_doc),
|
||||||
|
"top_30_pairs": [
|
||||||
|
{"a": a, "b": b, "count": c}
|
||||||
|
for (a, b), c in pairs.most_common(30)
|
||||||
|
],
|
||||||
|
"folder_crosstab": {
|
||||||
|
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
|
||||||
|
},
|
||||||
|
"bin_totals": dict(bin_totals),
|
||||||
|
"worker_versions": dict(versions),
|
||||||
|
"data_gap": gap,
|
||||||
|
"corpus_coverage": coverage,
|
||||||
|
}
|
||||||
|
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
|
||||||
|
print(f"JSON sidecar written: {OUT_PATH}")
|
||||||
|
|
||||||
|
cur.close()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user