Compare commits

...

5 Commits

Author SHA1 Message Date
aaron 7c7b649775 embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)
Writers now enforce type and created_at:
  - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
    'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
    DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
    created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
    re-ingest re-classifies type but does not overwrite a backfilled mtime.
  - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
    EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
    convo.updated_at; re-runs should refresh).
  - Column-level NOT NULL is not added; application-layer raise gives a
    faster, more debuggable failure than a Postgres constraint error.

Retrieval propagates type into chunks:
  - retrieve() SELECT now includes type; chunk dicts carry "type": etype.
  - WHERE clause built dynamically from excluded_sources and the new
    --type-filter CLI arg (experimental, default None, pgvector retrieval
    only — Graphiti chunks have no embeddings.type to filter on).
  - retrieve_graphiti unchanged; its chunks lack the type field.

Manifests carry type_distribution per stage:
  - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
    early_rem, late_rem — a Counter over chunk types, filtering None so
    Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
    distribution. Pgvector chunks always carry type post-backfill; if None
    appears, the backfill or writer enforcement has regressed.

Verification:
  B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
       all 84 chunks preserved created_at=2026-04-27T06:11:55Z
  B2 missing-type assertion raises ValueError, no row leaked to embeddings
  B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
       type_filter only, excl 2 elems, excl 1 elem edge case, both};
       all five plans use HNSW index scan with correct Filter clauses
  C1 retrieve("nrem") returns 8 chunks each carrying "type" key
  C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
       2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
       no single type >=90%)

The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
2026-05-04 00:15:43 +00:00
aaron 3c7c228db0 embeddings: backfill type and created_at (Improvement #2 part A)
Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit)
and 12,109 created_at-NULL rows via five batches:

  C1 filepath_stat:        9,649  filesystem mtime via metadata.filepath
  C2 watcher_state_unique:   676  unique source-name lookup in watcher_state
  C3 watcher_state_collision_pick_latest_of_N:
                             234  collision; most-recent watcher mtime
  C4 chatgpt_export:       1,548  convo create_time from export JSONs
                                  (168/168 distinct convo_ids resolved)
  C5 sentinel:                 2  2026-04-26T00:00:00Z (pgvector migration date)

Provenance written to metadata.type_source and metadata.created_at_source
on every row changed by this run. type_source is empty on rows where the
type field was already populated pre-run; in those cases the snapshot
table is the source of truth for what changed.

Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type,
created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join).

Verification:
  V1 live counts:      type_null=0  ca_null=0
  V2 spot-check 11 rows across cohorts: provenance correct
  V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved
  V4 cross-check vs snapshot: reconciles per-provenance to dry-run

Read-side use (B + C: writer enforcement + minimal retrieval read) deferred
to a separate session. The backfill is complete and verified, but the type
and created_at fields are not yet load-bearing — every current reader still
ignores them. Without B+C this lands as data prep, not behavior change.
2026-05-03 23:58:53 +00:00
aaron 2df1a2fe01 docs/inventory: layer 2026-05-03 updates (resolutions, corrections, new findings)
Inventory dated 2026-05-02 is preserved as a point-in-time snapshot. Today's
updates are layered on top in a dated addendum section after "Findings
summary" and before "Phase 1 — Scripts" so the original snapshot reads as
written and readers can see what changed and when.

Resolved:
- NREM-shape divergence #1 (`dream.py` cumulative cross-night exclusion
  500-cap) — replaced with session-scoped novelty.

Corrections to existing findings:
- `stage2_metadata` lives on `stage_3_queue`, not `stage_2_queue` (the
  2026-05-02 entry implied otherwise). Verified by direct schema read.
- Stage 2 char_length gate runs *before* the Mistral call. For sub-2000-char
  docs, Mistral is never invoked — frames are not extracted then discarded,
  they are simply not extracted. Reframes the architecture's "Stage 2
  produces orientation for everything" commitment.

New findings (from the 2026-05-03 frame analysis):
- `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation
  sources have zero frame coverage by design. Combined with the char-gate
  exclusion and Stage 2 failures, only 56% of corpus has any frame data.
- All 14 voice notes and all 39 dream outputs are in the 339-doc gap.
  Primary capture and self-reflection channels are silent to the frame
  system; dreamer cannot frame-condition on its own output.
- File-type \u00d7 frame stratification provides discriminating signal that
  cross-links Improvement #3 to the existing `embeddings.type` NULL-rate
  finding.

Same NREM shape as the original cumulative-exclusion bug — the architecture's
stated commitment and what the code actually does diverge silently. This is
exactly what the inventory exists to surface.
2026-05-03 20:32:55 +00:00
aaron ed2d090afc experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3)
Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).

Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
  (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
  fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
  co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
  data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.

Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
  validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
  Stage 2 by design (`ingest_conversations.py` writes directly to
  embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
  12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
  Primary capture and self-reflection channels are silent to the frame
  system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
  `Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
  cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
  prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
  Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
  rows.

No production code touched. View is droppable; script is read-only.
2026-05-03 20:32:37 +00:00
aaron e5898f3019 dream.py: replace cumulative cross-night exclusion with session-scoped novelty (Track 1 Finding 1)
The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on
overflow) was hiding ~40% of the corpus from Early REM and Late REM after the
cap filled. The architecture and reframe both specify session-scoped novelty,
not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02
NREM exclusion fix.

Changes:
- Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key
  from `dreamer_state.json` at pipeline start.
- Early REM excludes only the current session's NREM high-scorers.
- Late REM excludes only the current session's NREM \u222a Early REM.
- Remove the across-night accumulation block at the end of the pipeline; reuse
  the in-scope state object for the post-pipeline metadata write (eliminates a
  redundant disk re-read that was reintroducing the legacy key).

NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem",
excluded_sources=None)`).

Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early
REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key
absent from `dreamer_state.json` post-run.
2026-05-03 20:32:15 +00:00
10 changed files with 3280 additions and 38 deletions
@@ -65,6 +65,38 @@ The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that m
---
## Updates — 2026-05-03 session
*Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
### Resolved
- **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
### Corrections to existing findings
- **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
- **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
### New findings from 2026-05-03 frame analysis (Improvement #3)
- **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
- **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
- **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
### Artifacts produced 2026-05-03
- **Code change:** `scripts/dream.py` (Improvement #1).
- **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
- **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
- **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
- **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
---
## Phase 1 — Scripts
+175
View File
@@ -0,0 +1,175 @@
# Stage 2 Frame Analysis — 2026-05-03
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
---
## Verdict
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
---
## 1. Corpus-wide frame coverage
| Class | Count | % of corpus | Frame coverage |
|---|---|---|---|
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
| Files that failed Stage 2 | 12 | 1.0% | none |
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
## 2. Frame distribution (the docs that DO have frames)
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
| Frame | Count | % of frame-extracted docs |
|---|---|---|
| Education | 238 | 35.6% |
| Course | 58 | 8.7% |
| Programming | 43 | 6.4% |
| Design | 32 | 4.8% |
| Professional Experience | 24 | 3.6% |
| Employment | 24 | 3.6% |
| Research | 23 | 3.4% |
| 3D Printing | 22 | 3.3% |
| Project, Grading, Art, Budget | 21 each | 3.1% |
| Academic Integrity | 20 | 3.0% |
| Teaching, Technology, Attendance, Application | 1319 | — |
| Accommodation, Manufacturing, Coursework, Recommendation | 1013 | — |
**Per-doc frame count:** median 34 frames per doc; 76% of docs have 35 frames; one outlier doc has 30 frames (Mistral over-segmented).
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
## 3. Label hygiene
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
| Concept | Variant counts |
|---|---|
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
| Course Design | `Course Design`:9 + `Course_Design`:1 |
| Project Management | `Project Management`:7 + `Project_Management`:1 |
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
| (… 48 more) | |
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
## 4. Worker version drift
| Worker version | Doc count | Notes |
|---|---|---|
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
| v2.0 | 3 | Same key shape as v2.1 baseline. |
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
## 5. File-type signal
This is the most useful Track 2 finding from this report.
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
| Frame | pdf | docx | pptx | markdown | txt | dream |
|---|---|---|---|---|---|---|
| Education | 116 | 119 | 3 | — | — | — |
| Course | 29 | 29 | — | — | — | — |
| Programming | 12 | 10 | **15** | — | 6 | — |
| Application | **13** | 2 | — | — | — | — |
| 3D Printing | 11 | 3 | **8** | — | — | — |
| Manufacturing | 3 | 6 | 4 | — | — | — |
| Research | 9 | 13 | — | 1 | — | — |
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
## 6. Systematic exclusions inside the 339-doc gap
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
| Type | Count | What this is |
|---|---|---|
| pdf | 110 | Short PDFs (forms, single-page docs) |
| docx | 110 | Short Word docs |
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
| pptx | 31 | Short slide decks |
| txt | 28 | Plain-text files |
| voice_note | 14 | **Every voice note in the corpus** |
| markdown | 7 | Short markdown |
**Two specific systematic exclusions worth naming separately:**
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
---
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
**Defined scope (the coverage caveat):**
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
---
## 8. Recommended follow-ups (ordered by ROI)
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
---
## 9. Inventory edits flagged for session-end batch
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
## 10. Reproduction
```bash
cd ~/aaronai
venv/bin/python3 scripts/experiments/frame_distribution_report.py
# stdout: human-readable report
# json: experiments/frame_distribution_<date>.json
# view: stage2_frames_v (in pgvector DB)
```
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
@@ -0,0 +1,857 @@
{
"generated_at": "2026-05-03T23:47:54.802182+00:00",
"section_1": {
"overall": {
"total": 14069,
"type_null": 9815,
"ca_null": 12109,
"both_null": 9815,
"both_set": 1960
},
"cohorts": [
{
"type": "aaronai_conversation",
"ca_null": false,
"n": 71
},
{
"type": "chatgpt_conversation",
"ca_null": true,
"n": 1548
},
{
"type": "claude_conversation",
"ca_null": false,
"n": 1074
},
{
"type": "claude_memory",
"ca_null": true,
"n": 1
},
{
"type": "document",
"ca_null": false,
"n": 815
},
{
"type": "document",
"ca_null": true,
"n": 745
},
{
"type": null,
"ca_null": true,
"n": 9815
}
]
},
"section_2": {
"by_ext": [
{
"ext": ".pdf",
"rows": 6886
},
{
"ext": ".txt",
"rows": 1501
},
{
"ext": ".docx",
"rows": 1048
},
{
"ext": ".pptx",
"rows": 353
},
{
"ext": ".md",
"rows": 27
}
],
"classified": 9815,
"unclassifiable": 0
},
"section_3": {
"watcher_state_paths": 1462,
"watcher_state_basenames": 1183,
"watcher_state_collisions": 109,
"rows_with_filepath": {
"total": 9816,
"exists": 9649,
"missing": 167,
"outside_root": 0,
"sample": [
{
"id": "f317f238_0",
"source": "NO thesis proposal.docx",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF790 Thesis/Nic OConnor/NO thesis proposal.docx",
"mtime": "2024-01-26T15:06:09Z"
},
{
"id": "81047646_0",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "81047646_1",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "4e49d3b4_4",
"source": "Circuit Intro.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF310 Mechatronics/Week 1/Circuit Intro.pdf",
"mtime": "2022-01-31T23:28:56Z"
},
{
"id": "81047646_2",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
}
]
},
"rows_without_filepath": {
"total": 744,
"distinct_basenames": 228,
"unique_hit": 211,
"collision_hit": 16,
"unfound": 1
},
"collision_shapes": {
"total": 109,
"shape_counts": {
"multi-live": 95,
"live+archive": 14
},
"rows_affected_by_shape": {
"multi-live": 85,
"live+archive": 0
},
"samples": {
"multi-live": [
{
"name": "README.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/README.md",
"mtime": "2026-04-25T17:08:01Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Processing/Nature of Code/The-Nature-of-Code-Examples/The-Nature-of-Code-Examples-master/README.md",
"mtime": "2017-03-09T23:32:59Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/samples/hal/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/platforms/maven/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/hal/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/carotene/README.md",
"mtime": "2016-12-21T10:37:02Z"
}
]
},
{
"name": "3DPrinting_v2.pptx",
"rows_no_fp_using_this_name": 4,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Innovation Center/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:49Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Cuba/Assets/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:18Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Conference/3D Printing/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:15Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Workshops/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:30:14Z"
}
]
},
{
"name": "Print in Place.docx",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF205 CAD1/Print in Place.docx",
"mtime": "2017-08-24T03:50:36Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/ARS393 CVS1/Print in Place.docx",
"mtime": "2015-10-28T20:36:52Z"
}
]
}
],
"live+archive": [
{
"name": "dreamer-design-spec.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
}
]
},
{
"name": "BirdAI-Ingest-Architecture.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
}
]
},
{
"name": "graphiti-migration-plan.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/Migration Plans/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
}
]
}
]
}
}
},
"section_4": {
"export_dir_exists": true,
"files": [
{
"name": "conversations-000.json",
"size": 19050556,
"mtime": "2026-04-24T19:55:44Z"
},
{
"name": "conversations-001.json",
"size": 29057594,
"mtime": "2026-04-24T19:55:44Z"
}
],
"convo_index_size": 169,
"sample_results": [
{
"id": "chatgpt_87cc0c47-aaf9-42da-8169-3b8922f3afba_0",
"source": "ChatGPT: Dog named Bird",
"convo_id": "87cc0c47-aaf9-42da-8169-3b8922f3afba",
"create_time": 1708835138.51948,
"create_time_iso": "2024-02-25T04:25:38.519480Z",
"resolved": true
},
{
"id": "chatgpt_689fab3e-d79c-8333-aeb5-7da4e9ca160d_0",
"source": "ChatGPT: Video understanding limitations",
"convo_id": "689fab3e-d79c-8333-aeb5-7da4e9ca160d",
"create_time": 1755294541.894811,
"create_time_iso": "2025-08-15T21:49:01.894811Z",
"resolved": true
},
{
"id": "chatgpt_611ff391-7fc0-42ea-bfd9-18dbe1739f19_7",
"source": "ChatGPT: Calculating Truncated Cone Angle",
"convo_id": "611ff391-7fc0-42ea-bfd9-18dbe1739f19",
"create_time": 1724020869.471264,
"create_time_iso": "2024-08-18T22:41:09.471264Z",
"resolved": true
},
{
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_50",
"source": "ChatGPT: Soul music playlist ideas",
"convo_id": "68ce1921-084c-8330-877c-78df1e03e54c",
"create_time": 1758337313.438344,
"create_time_iso": "2025-09-20T03:01:53.438344Z",
"resolved": true
},
{
"id": "chatgpt_c02e94f0-17db-4fd9-be04-13aaa1b728cb_1",
"source": "ChatGPT: Create Rhino plugin in Python",
"convo_id": "c02e94f0-17db-4fd9-be04-13aaa1b728cb",
"create_time": 1682716259.557353,
"create_time_iso": "2023-04-28T21:10:59.557353Z",
"resolved": true
}
],
"sample_resolved": 5,
"full_cohort": {
"distinct_convo_ids": 168,
"resolvable_from_export": 168,
"unresolvable": 0
}
},
"section_5": {
"earliest_per_type": [
{
"type": "aaronai_conversation",
"earliest": "2026-04-26T17:43:28.056503",
"latest": "2026-05-03T01:45:21.469613",
"rows": 71
},
{
"type": "claude_conversation",
"earliest": "2026-02-28T20:33:36.146998Z",
"latest": "2026-04-23T04:26:00.015419Z",
"rows": 1074
},
{
"type": "document",
"earliest": "2026-04-30 16:42:55.360736+00",
"latest": "2026-05-03 20:14:33.13663+00",
"rows": 815
}
],
"git_findings": [
"037d7475738352dd13620486b5154d58fa6c037b 2026-04-28 00:15:46 +0000 chore: archive deprecated chromadb and migration scripts",
"67766371789276ec4bcb8bac271b6eb9ddafa888 2026-04-27 05:16:37 +0000 Remove hardcoded PG password fallbacks \u2014 require PG_DSN env var in all scripts",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"8c8fba11b8d1b359b9b7722fc19b6ef562b812d8 2026-04-26 21:28:40 +0000 Add nightly conversation indexing \u2014 Aaron AI conversations into pgvector at 2:30AM",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"d2eed9890665a78a37fb5d336e8af75e7f2acb42 2026-04-26 20:19:49 +0000 Pre-pgvector migration checkpoint \u2014 upsert, allow_replace_deleted, maintenance timer"
],
"chromadb_candidates": [],
"proposed_sentinel": "2026-04-26T00:00:00Z",
"reasoning": "git f78b830 'Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL created_at all predate F11 and most predate the pgvector cutover itself. 2026-04-26 is the date the ChromaDB->pgvector migration script was committed, so any row currently in the embeddings table with NULL created_at must have been ingested on or after that date (when the table came into existence in current form). It is the tightest defensible upper bound on 'the row entered pgvector before timestamps were tracked', so it is the right sentinel."
},
"section_6": [
{
"cohort": "A (type NULL, ca NULL)",
"id": "f66c7390_6",
"source": "Design Guide - FDM for Composite Tooling 2.0.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2023-08-24T18:17:01Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "9cf798f8_151",
"source": "Shop Class as Soulcraft An inquiry into the value of the -- Crawford, Matthew.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-30T21:17:40.708026Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "fc378df0_329",
"source": "ulysses.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2017-10-12T14:20:59Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "812bd5c6_0",
"source": "Bennington College Cover Letter.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2013-03-29T20:32:23Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "91ccefdd_185",
"source": "Cognition in the Wild (A Bradford Book) -- Hutchins, Edwin.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:21:35Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "48fa3d53_2",
"source": "CMakeLists.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2016-12-21T10:37:05Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "49e3545d_9",
"source": "RH50-TM-L1-EN-20140902.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2014-09-02T18:44:08Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "a8366d89_144",
"source": "Hackers and Painters_ Big Ideas from the Computer Age -- Graham, Paul.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:25:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "3e3097f8_46",
"source": "The Nature and Art of Workmanship -- David Pye.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:24:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "87f9a5cf_269",
"source": "Supersizing the Mind_ Embodiment, Action, and Cognitive -- Andy Clark.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:14:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_61",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "592a1366_0",
"source": "2026-04-29-synthesis.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T08:00:57.634567Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cfb0a691_3",
"source": "Consolidator-0.1-Specification.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T03:34:31Z",
"inferred_ca_source": "watcher_state_unique"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_57",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "e65ef61c_8",
"source": "BirdAI-Research-Context.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T15:57:07Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "4dce2922_3",
"source": "cascade-optimization-protocol.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-28T05:46:24Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "077cc52d_1",
"source": "graphiti-migration-plan.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T17:54:40Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_70",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "1f15bccf_38",
"source": "BirdAI-Experiments-Log.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-05-01T16:40:02Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_13",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_30",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_7",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_67fc4254-ef50-8009-9e0f-81864cca7cec_1",
"source": "ChatGPT: Job Application Review",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68f3d936-d74c-8329-91df-fe838e292170_5",
"source": "ChatGPT: SEC coaches with OSU ties",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d1b5b-bb4c-832b-8d2e-11a86a569fcc_4",
"source": "ChatGPT: Hosting app platforms",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_bfa1cd2f-b8ab-4b11-b844-c47b2fa70612_1",
"source": "ChatGPT: New chat",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_37",
"source": "ChatGPT: Soul music playlist ideas",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_10",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_10",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_690286bd-0758-8332-8491-5d00c77f4696_1",
"source": "ChatGPT: Airbrushing and finishing setup",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_0",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_208",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_93",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_4",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_175",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_101",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_268",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_5",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_132",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_86",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_dacf89e3-1ee7-400d-8461-ef5920c82fe3_96",
"source": "Claude: University of Utah interview teaching example",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-11T18:05:57.594832Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-11T18:05:57.594832Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_c0baf4b0-a7bb-4664-ac7b-98d7b02f56a6_26",
"source": "Claude: Weighing Utah versus Oklahoma",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-01T19:08:26.722197Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-01T19:08:26.722197Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_92",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_42dbddc5-12ba-4de7-a685-043473189da9_6",
"source": "Claude: I filling out my annual report...",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-24T14:34:47.870625Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-24T14:34:47.870625Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_1344",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_28ee8a447d3fc922_6",
"source": "Aaron AI: I'm working on you",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-26T17:43:28.056503",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-26T17:43:28.056503",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_20",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_21cabf771708df70_42",
"source": "Aaron AI: What should I be the most excited about right now?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-27T07:06:03.996026",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-27T07:06:03.996026",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_12",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_ed40b4278a9c8110_4",
"source": "Aaron AI: Let's say you're building an analog of the human brain, and ...",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-05-03T01:45:21.469613",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-05-03T01:45:21.469613",
"inferred_ca_source": "preserved"
}
]
}
@@ -0,0 +1,987 @@
{
"generated_at": "2026-05-03T20:21:33.558462",
"n_docs_with_frames": 668,
"n_distinct_labels": 1374,
"top_30_frames": [
[
"Education",
238
],
[
"Course",
58
],
[
"Programming",
43
],
[
"Design",
32
],
[
"Professional Experience",
24
],
[
"Employment",
24
],
[
"Research",
23
],
[
"3D Printing",
22
],
[
"Project",
21
],
[
"Grading",
21
],
[
"Art",
21
],
[
"Budget",
21
],
[
"Academic Integrity",
20
],
[
"Teaching",
19
],
[
"Technology",
18
],
[
"Attendance",
17
],
[
"Application",
15
],
[
"Accommodation",
13
],
[
"Manufacturing",
13
],
[
"Coursework",
11
],
[
"Recommendation",
10
],
[
"Manufacturing Process",
10
],
[
"Additive Manufacturing",
10
],
[
"Job Application",
10
],
[
"Exhibitions",
10
],
[
"Academic Administration",
9
],
[
"Communication",
9
],
[
"Course Design",
9
],
[
"Veteran and Military Services",
9
],
[
"Career",
9
]
],
"label_collisions": {
"conversational": [
[
"Conversational",
1
],
[
"conversational",
1
]
],
"content": [
[
"Content",
1
],
[
"content",
1
]
],
"cascade": [
[
"Cascade",
1
],
[
"cascade",
1
]
],
"education": [
[
"Education",
238
],
[
"education",
1
]
],
"academic record": [
[
"Academic_Record",
1
],
[
"Academic Record",
1
]
],
"independent study": [
[
"Independent Study",
5
],
[
"Independent_Study",
2
]
],
"project management": [
[
"Project Management",
7
],
[
"Project_Management",
1
]
],
"digital fabrication": [
[
"Digital Fabrication",
6
],
[
"digital_fabrication",
1
],
[
"digital fabrication",
1
]
],
"project proposal": [
[
"Project_Proposal",
2
],
[
"Project Proposal",
2
]
],
"academic integrity": [
[
"Academic Integrity",
20
],
[
"Academic_Integrity",
2
]
],
"3d printing": [
[
"3D Printing",
22
],
[
"3D_Printing",
7
]
],
"technical skills": [
[
"Technical Skills",
2
],
[
"Technical_Skills",
1
]
],
"course structure": [
[
"Course Structure",
7
],
[
"Course_Structure",
1
]
],
"course design": [
[
"Course Design",
9
],
[
"Course_Design",
1
]
],
"product design": [
[
"Product Design",
6
],
[
"Product_Design",
1
]
],
"professional experience": [
[
"Professional Experience",
24
],
[
"Professional_Experience",
6
]
],
"disability accommodations": [
[
"Disability Accommodations",
4
],
[
"Disability_Accommodations",
1
]
],
"material science": [
[
"Material_Science",
2
],
[
"Material Science",
4
]
],
"computational design": [
[
"Computational Design",
7
],
[
"Computational_Design",
1
]
],
"computer services policy": [
[
"Computer Services Policy",
6
],
[
"Computer_Services_Policy",
1
]
],
"work experience": [
[
"Work_Experience",
1
],
[
"Work Experience",
3
]
],
"academic program": [
[
"Academic Program",
7
],
[
"Academic_Program",
1
]
],
"project-based learning": [
[
"Project-Based Learning",
5
],
[
"Project-Based_Learning",
1
],
[
"Project-based Learning",
2
]
],
"art and design": [
[
"Art and Design",
6
],
[
"Art_and_Design",
1
]
],
"fdm technology": [
[
"FDM_Technology",
2
],
[
"FDM Technology",
1
]
],
"material selection": [
[
"Material_Selection",
1
],
[
"Material Selection",
1
]
],
"product development": [
[
"Product Development",
6
],
[
"Product_Development",
2
]
],
"market research": [
[
"Market_Research",
1
],
[
"Market Research",
2
]
],
"computer services": [
[
"Computer Services",
2
],
[
"Computer_Services",
1
]
],
"student evaluation of instruction": [
[
"Student Evaluation of Instruction",
1
],
[
"Student_Evaluation_of_Instruction",
1
]
],
"course management": [
[
"Course_Management",
1
],
[
"Course Management",
1
]
],
"grade policy": [
[
"Grade_Policy",
1
],
[
"Grade Policy",
1
]
],
"academic transcript": [
[
"Academic_Transcript",
1
],
[
"Academic Transcript",
1
]
],
"evaluation criteria": [
[
"Evaluation Criteria",
1
],
[
"Evaluation_Criteria",
1
]
],
"computer science": [
[
"Computer Science",
2
],
[
"Computer_Science",
1
]
],
"electrical circuit": [
[
"Electrical Circuit",
2
],
[
"Electrical_Circuit",
1
]
],
"digital logic": [
[
"Digital Logic",
1
],
[
"Digital_Logic",
1
]
],
"course description": [
[
"Course Description",
3
],
[
"Course_Description",
1
]
],
"organizational structure": [
[
"Organizational_Structure",
1
],
[
"Organizational Structure",
1
]
],
"digital design": [
[
"Digital_Design",
1
],
[
"Digital Design",
4
]
],
"contact information": [
[
"Contact Information",
2
],
[
"Contact_Information",
1
]
],
"professional career": [
[
"Professional_Career",
2
],
[
"Professional Career",
1
]
],
"personal projects": [
[
"Personal_Projects",
1
],
[
"Personal Projects",
2
]
],
"ai development": [
[
"AI_Development",
1
],
[
"AI Development",
1
]
],
"university service": [
[
"University Service",
2
],
[
"University_Service",
1
]
],
"professional exhibitions and publications": [
[
"Professional Exhibitions and Publications",
1
],
[
"Professional_Exhibitions_and_Publications",
1
]
],
"selected external consulting and design work": [
[
"Selected External Consulting and Design Work",
1
],
[
"Selected_External_Consulting_and_Design_Work",
2
]
],
"academic career": [
[
"Academic_Career",
1
],
[
"Academic Career",
2
]
],
"technology integration": [
[
"Technology Integration",
2
],
[
"Technology_Integration",
1
]
],
"artistic practice": [
[
"Artistic_Practice",
1
],
[
"Artistic Practice",
1
]
],
"multi-material 3d printing": [
[
"Multi-Material 3D Printing",
1
],
[
"Multi-material 3D Printing",
1
]
],
"community engagement": [
[
"Community Engagement",
3
],
[
"Community_Engagement",
1
]
],
"digitaldesignandfabrication": [
[
"DigitalDesignAndFabrication",
1
],
[
"DigitalDesignandFabrication",
1
]
],
"professional background": [
[
"Professional Background",
3
],
[
"Professional_Background",
1
]
]
},
"per_doc_frame_count": {
"3": 282,
"5": 67,
"4": 195,
"2": 57,
"7": 13,
"11": 5,
"13": 2,
"15": 1,
"12": 4,
"6": 21,
"8": 8,
"10": 4,
"9": 6,
"30": 1,
"14": 1,
"18": 1
},
"top_30_pairs": [
{
"a": "Course",
"b": "Education",
"count": 46
},
{
"a": "Education",
"b": "Project",
"count": 20
},
{
"a": "Design",
"b": "Education",
"count": 20
},
{
"a": "Education",
"b": "Professional Experience",
"count": 20
},
{
"a": "Education",
"b": "Employment",
"count": 20
},
{
"a": "Education",
"b": "Technology",
"count": 18
},
{
"a": "Education",
"b": "Grading",
"count": 17
},
{
"a": "Education",
"b": "Research",
"count": 15
},
{
"a": "Art",
"b": "Education",
"count": 15
},
{
"a": "Attendance",
"b": "Grading",
"count": 14
},
{
"a": "Course",
"b": "Grading",
"count": 13
},
{
"a": "Academic Integrity",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Course",
"count": 11
},
{
"a": "Application",
"b": "Employment",
"count": 11
},
{
"a": "Coursework",
"b": "Education",
"count": 10
},
{
"a": "Course",
"b": "Design",
"count": 10
},
{
"a": "Course",
"b": "Programming",
"count": 10
},
{
"a": "Application",
"b": "Education",
"count": 10
},
{
"a": "Budget",
"b": "Education",
"count": 10
},
{
"a": "Academic Integrity",
"b": "Accommodation",
"count": 9
},
{
"a": "Education",
"b": "Teaching",
"count": 9
},
{
"a": "Education",
"b": "Programming",
"count": 9
},
{
"a": "Academic Integrity",
"b": "Attendance",
"count": 9
},
{
"a": "Course",
"b": "Project",
"count": 8
},
{
"a": "Research",
"b": "Teaching",
"count": 8
},
{
"a": "Grading",
"b": "Project",
"count": 7
},
{
"a": "Art",
"b": "Technology",
"count": 7
},
{
"a": "Academic Integrity",
"b": "Course",
"count": 7
},
{
"a": "Accommodation",
"b": "Course",
"count": 7
}
],
"folder_crosstab": {
"Education": {
"pdf": 116,
"docx": 119,
"pptx": 3
},
"Course": {
"pdf": 29,
"docx": 29
},
"Programming": {
"pptx": 15,
"docx": 10,
"pdf": 12,
"txt": 6
},
"Design": {
"pdf": 13,
"docx": 16,
"pptx": 3
},
"Professional Experience": {
"docx": 13,
"pdf": 11
},
"Employment": {
"pdf": 15,
"docx": 9
},
"Research": {
"pdf": 9,
"docx": 13,
"markdown": 1
},
"3D Printing": {
"docx": 3,
"pdf": 11,
"pptx": 8
},
"Project": {
"pdf": 8,
"docx": 12,
"markdown": 1
},
"Grading": {
"pdf": 10,
"docx": 11
},
"Art": {
"docx": 11,
"pdf": 9,
"pptx": 1
},
"Budget": {
"docx": 6,
"pdf": 15
},
"Academic Integrity": {
"docx": 17,
"pdf": 3
},
"Teaching": {
"pdf": 9,
"docx": 10
},
"Technology": {
"docx": 15,
"pdf": 3
},
"Attendance": {
"docx": 11,
"pdf": 6
},
"Application": {
"pdf": 13,
"docx": 2
},
"Accommodation": {
"docx": 11,
"pdf": 2
},
"Manufacturing": {
"docx": 6,
"pptx": 4,
"pdf": 3
},
"Coursework": {
"pdf": 8,
"docx": 3
}
},
"bin_totals": {
"markdown": 64,
"pdf": 286,
"pptx": 70,
"txt": 28,
"docx": 217,
"dream_output": 3
},
"worker_versions": {
"2.0": 3,
"2.1": 665
},
"data_gap": {
"count": 339,
"by_type_bin": {
"pdf": 110,
"voice_note": 14,
"docx": 110,
"dream_output": 39,
"pptx": 31,
"txt": 28,
"markdown": 7
},
"char_length": {
"min": 6,
"max": 1998,
"median": 1077
},
"sample_sources": [
"Thesis Paper Guidlines.pdf",
"2026-04-30-17-06-voice.md",
"2026-04-30-15-59-voice.md",
"2026-04-30-16-53-voice.md",
"2026-04-30-16-23-voice.md",
"2026-04-29-17-52-voice.md",
"2026-04-30-16-59-voice.md",
"Outline for 3D Printed Materials for Foundry Casting.docx",
"2026-04-26-22-52-voice.md",
"2026-04-30-synthesis.md"
]
},
"corpus_coverage": {
"total_distinct_sources_in_embeddings": 1255,
"conversations_no_frames_by_design": 198,
"files_with_frames": 704,
"files_short_no_frames": 339,
"files_stage2_failed": 12,
"frame_coverage_pct": 56.1
}
}
+47 -37
View File
@@ -16,6 +16,7 @@ import os
import json
import sqlite3
import argparse
from collections import Counter
from pathlib import Path
from datetime import datetime, timedelta
from dotenv import load_dotenv
@@ -282,9 +283,11 @@ def retrieve_graphiti(mode, task=None, n_results=8, excluded_sources=None):
print(f"[Graphiti retrieval error: {e}] — falling back to empty.")
return []
def retrieve(mode, task=None, n_results=8, excluded_sources=None):
def retrieve(mode, task=None, n_results=8, excluded_sources=None, type_filter=None):
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search
# Default behavior: pgvector similarity search (unchanged)
# type_filter is experimental and applies to pgvector retrieval only — Graphiti
# facts are not embeddings rows and have no embeddings.type to filter on.
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
if substrate == "graphiti":
return retrieve_graphiti(mode, task=task, n_results=n_results, excluded_sources=excluded_sources)
@@ -311,23 +314,23 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None):
pg = get_pg()
cur = pg.cursor()
excluded_sources = excluded_sources or set()
where, params = [], []
if excluded_sources:
cur.execute("""
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
WHERE source NOT IN %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (embedding, tuple(excluded_sources), embedding, n_results * 3))
else:
cur.execute("""
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (embedding, embedding, n_results * 3))
where.append("source NOT IN %s")
params.append(tuple(excluded_sources))
if type_filter:
where.append("type = ANY(%s)")
params.append(list(type_filter))
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
cur.execute(f"""
SELECT document, source, type, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
{where_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s
""", [embedding, *params, embedding, n_results * 3])
for doc, source, similarity in cur.fetchall():
for doc, source, etype, similarity in cur.fetchall():
if not (low <= similarity <= high):
continue
if source in seen_sources:
@@ -337,6 +340,7 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None):
"content": doc,
"relevance": similarity,
"similarity": similarity,
"type": etype,
})
seen_sources.add(source)
if len(chunks) >= n_results:
@@ -482,7 +486,7 @@ def write_manifest(date_str, stage_data, corpus_data):
print(f"Manifest write failed (non-critical): {e}")
def dream_pipeline():
def dream_pipeline(type_filter=None):
"""
Full nightly pipeline — interdependent stages.
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
@@ -490,18 +494,18 @@ def dream_pipeline():
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
state = load_dreamer_state()
previously_retrieved = set(state.get("retrieved_sources", []))
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
session_retrieved = set()
delta = observe_corpus()
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream")
print(f"Excluding {len(previously_retrieved)} previously retrieved sources")
print("Novelty: session-scoped (no across-night exclusion)")
# ── Stage 1: NREM ──────────────────────────────────────────────────────
print("\n[NREM] Retrieving...")
# NREM is replay-and-consolidation — does not exclude prior traces.
# Late REM and Early REM exclude prior content for novelty; NREM does not.
nrem_chunks = retrieve("nrem", excluded_sources=None)
nrem_chunks = retrieve("nrem", excluded_sources=None, type_filter=type_filter)
session_retrieved.update(c["source"] for c in nrem_chunks)
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
@@ -523,6 +527,10 @@ def dream_pipeline():
"sources": nrem_sources,
"distinct_folders": nrem_folders,
"folder_count": len(nrem_folders),
# Counter filters None: Graphiti chunks lack `type` (facts, not embeddings rows).
# Pgvector chunks always carry type post-Improvement-#2 backfill. If type
# ever appears as None here, the backfill or writer enforcement has regressed.
"type_distribution": dict(Counter(c.get("type") for c in nrem_chunks if c.get("type"))),
"status": "ok",
}
}
@@ -532,7 +540,7 @@ def dream_pipeline():
print("\n[Early REM] Retrieving...")
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
# Sources that scored in Early REM band during NREM remain available
early_chunks = retrieve("early-rem", excluded_sources=previously_retrieved | nrem_high_sources)
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources, type_filter=type_filter)
session_retrieved.update(c["source"] for c in early_chunks)
if not early_chunks:
print("[Early REM] No suitable chunks — skipping")
@@ -551,13 +559,14 @@ def dream_pipeline():
"sources": early_sources,
"distinct_folders": early_folders,
"folder_count": len(early_folders),
"type_distribution": dict(Counter(c.get("type") for c in early_chunks if c.get("type"))),
"status": "ok",
}
print(f"[Early REM] Done.\n{early_rem_output[:200]}...")
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
print("\n[Late REM] Retrieving...")
late_chunks = retrieve("late-rem", excluded_sources=previously_retrieved | session_retrieved)
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved, type_filter=type_filter)
session_retrieved.update(c["source"] for c in late_chunks)
if not late_chunks:
print("[Late REM] No suitable chunks — skipping")
@@ -582,6 +591,7 @@ def dream_pipeline():
"distinct_folders": list(set(late_folders)),
"folder_count": len(set(late_folders)),
"cross_domain_pairs": cross_domain_pairs,
"type_distribution": dict(Counter(c.get("type") for c in late_chunks if c.get("type"))),
"status": "ok",
}
print(f"[Late REM] Done.\n{late_rem_output[:200]}...")
@@ -616,18 +626,11 @@ def dream_pipeline():
}
write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data)
# Update state and notify
state = load_dreamer_state()
# Update state and notify (reuse state from start of pipeline; legacy key already popped)
state["last_dream_timestamp"] = datetime.now().timestamp()
state["last_dream_mode"] = "pipeline"
state["last_dream_file"] = synthesis_file
# Accumulate retrieved sources across nights. Cap at 500, trim to 400 on overflow.
all_retrieved = list(previously_retrieved | session_retrieved)
if len(all_retrieved) > 500:
all_retrieved = all_retrieved[-400:]
state["retrieved_sources"] = all_retrieved
save_dreamer_state(state)
notify_sse("synthesis", synthesis_file.split("/")[-1])
@@ -635,10 +638,10 @@ def dream_pipeline():
return synthesis_file
def dream_lucid(task):
def dream_lucid(task, type_filter=None):
"""On-demand lucid dream — single mode, used by Dream Now in settings."""
print(f"Lucid dream starting — task: {task[:80] if task else 'none'}")
chunks = retrieve("lucid", task=task)
chunks = retrieve("lucid", task=task, type_filter=type_filter)
if not chunks:
print("No suitable chunks — aborting")
return None
@@ -660,13 +663,13 @@ def dream_lucid(task):
return filepath
def dream_single(mode, task=None):
def dream_single(mode, task=None, type_filter=None):
"""
Single mode — used by Dream Now for non-lucid modes.
Runs one stage independently (for testing/tuning individual stages).
"""
print(f"Single mode dream: {mode}")
chunks = retrieve(mode, task=task)
chunks = retrieve(mode, task=task, type_filter=type_filter)
if not chunks:
print("No suitable chunks — aborting")
return None
@@ -703,12 +706,19 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Aaron AI Dreamer")
parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"])
parser.add_argument("--task", type=str)
parser.add_argument(
"--type-filter", type=str, default=None,
help="Comma-separated embeddings.type allowlist (e.g. 'document,aaronai_conversation'). "
"Applies to pgvector retrieval only; Graphiti chunks are not filtered. "
"Experimental — default is no filter, no behavior change.",
)
args = parser.parse_args()
type_filter = [t.strip() for t in args.type_filter.split(",")] if args.type_filter else None
if args.mode == "lucid":
dream_lucid(args.task or "What should I be thinking about that I am not?")
dream_lucid(args.task or "What should I be thinking about that I am not?", type_filter=type_filter)
elif args.mode and args.mode != "pipeline":
dream_single(args.mode, args.task)
dream_single(args.mode, args.task, type_filter=type_filter)
else:
# Default: full pipeline
dream_pipeline()
dream_pipeline(type_filter=type_filter)
+16 -1
View File
@@ -101,11 +101,24 @@ def chunk_and_embed(text: str,
def write_embeddings_batch(conn, batch: list[dict]) -> int:
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits."""
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits.
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
callers do not need to provide it. The application-layer assertion is the
primary enforcement point for type — the column lacks NOT NULL because
historical NULLs were resolved by the Improvement #2 backfill, and a
Python-level raise gives a faster, more debuggable failure than a
Postgres constraint error.
"""
if not batch:
return 0
cur = conn.cursor()
for row in batch:
if not row.get("type"):
raise ValueError(
f"row {row.get('id')!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
@@ -113,6 +126,8 @@ def write_embeddings_batch(conn, batch: list[dict]) -> int:
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
type = EXCLUDED.type,
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
metadata = EXCLUDED.metadata
""", (row["id"], row["document"], row["embedding"],
row["source"], row["type"], json.dumps(row["metadata"])))
@@ -0,0 +1,304 @@
"""Backfill embeddings.type and embeddings.created_at (Improvement #2 / A.3).
Idempotent on cohort predicates (every WHERE clause includes IS NULL on the
target column). Writes provenance to metadata.type_source and metadata.created_at_source
so each row is auditable and revertable per-source. Default --dry-run=True.
Order of batches:
T1. type backfill: WHERE type IS NULL -> 'document' (extension-classified, all hit).
C1. created_at: WHERE ca IS NULL AND metadata.filepath stat-resolves -> filesystem mtime.
C2. created_at: WHERE ca IS NULL AND source has unique watcher_state path -> watcher mtime.
C3. created_at: WHERE ca IS NULL AND source has watcher_state collision -> most-recent mtime.
C4. created_at: WHERE type='chatgpt_conversation' AND ca IS NULL -> export-resolved create_time.
C5. created_at: WHERE ca IS NULL (residual) -> sentinel.
Snapshot table embeddings_backup_2026_05_03 must exist before --apply.
Usage:
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py # dry-run
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py --apply # write
Exits non-zero if snapshot is missing on --apply.
"""
import argparse
import json
import os
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor, Json
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
SNAPSHOT_TABLE = "embeddings_backup_2026_05_03"
SENTINEL_ISO = "2026-04-26T00:00:00Z"
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(t):
bar = "=" * 70
print(f"\n{bar}\n{t}\n{bar}")
def fmt_ts_unix(ts):
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
def fmt_ts_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
state = json.loads(WATCHER_STATE.read_text())
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_name
def load_chatgpt_index():
if not CHATGPT_EXPORT_DIR.exists():
return {}
index = {}
for f in sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json")):
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception:
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
index[cid] = ct
return index
def assert_snapshot(cur):
cur.execute("SELECT to_regclass(%s) AS t;", (SNAPSHOT_TABLE,))
if cur.fetchone()["t"] is None:
print(f"ERROR: snapshot table '{SNAPSHOT_TABLE}' not found. Run A.2 first.")
sys.exit(2)
cur.execute(f"SELECT COUNT(*) AS n FROM {SNAPSHOT_TABLE};")
snap = cur.fetchone()["n"]
cur.execute("SELECT COUNT(*) AS n FROM embeddings;")
live = cur.fetchone()["n"]
print(f"snapshot {SNAPSHOT_TABLE}: {snap} rows; live embeddings: {live} rows")
if snap != live:
print(f"ERROR: snapshot row count != live ({snap} vs {live}). Refresh snapshot before --apply.")
sys.exit(2)
# ─── Batch primitive ────────────────────────────────────────────────────────
def run_batch(cur, label, candidates, apply_mode):
"""candidates: list of (id, set_type, set_ca, type_source, ca_source).
set_type / set_ca may be None to leave that column alone.
In dry-run we still execute UPDATEs inside an outer transaction (rolled back
at the end) so subsequent batches' SELECTs see the correct intermediate state."""
n = len(candidates)
print(f" {label}: {n} rows queued")
if n == 0:
return 0
for c in candidates[:3]:
print(f" sample: id={c[0]} type={c[1]!r} ca={c[2]!r} type_src={c[3]} ca_src={c[4]}")
n_written = 0
for row_id, set_type, set_ca, type_src, ca_src in candidates:
meta_patch = {}
if type_src:
meta_patch["type_source"] = type_src
if ca_src:
meta_patch["created_at_source"] = ca_src
# Build set list dynamically.
sets, params = [], []
if set_type is not None:
sets.append("type = %s")
params.append(set_type)
if set_ca is not None:
sets.append("created_at = %s")
params.append(set_ca)
if meta_patch:
sets.append("metadata = COALESCE(metadata, '{}'::jsonb) || %s::jsonb")
params.append(json.dumps(meta_patch))
params.append(row_id)
cur.execute(f"UPDATE embeddings SET {', '.join(sets)} WHERE id = %s;", params)
n_written += cur.rowcount
print(f" {n_written} rows updated{' (will rollback)' if not apply_mode else ''}")
return n_written
# ─── Batches ────────────────────────────────────────────────────────────────
def batch_T1_type(cur, apply_mode):
"""type IS NULL -> 'document'. All cohort A rows have a SUPPORTED extension."""
cur.execute("""
SELECT id, source FROM embeddings WHERE type IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], "document", None, "inferred_extension", None) for r in rows]
return run_batch(cur, "T1 type IS NULL -> 'document'", cands, apply_mode)
def batch_C1_filepath_stat(cur, apply_mode):
"""ca IS NULL AND metadata.filepath stat-resolves -> mtime."""
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL
ORDER BY id;
""")
rows = cur.fetchall()
cands, n_skipped_missing = [], 0
for r in rows:
p = Path(r["fp"])
if p.exists():
mt = fmt_ts_mtime(p)
if mt:
cands.append((r["id"], None, mt, None, "filepath_stat"))
continue
n_skipped_missing += 1
print(f" C1 candidates: {len(cands)} (skipped {n_skipped_missing} where filepath gone or unstattable)")
return run_batch(cur, "C1 ca IS NULL AND filepath stat-resolves -> mtime", cands, apply_mode)
def batch_C2_C3_watcher_state(cur, apply_mode):
"""ca IS NULL AND filepath unresolvable -> watcher_state by source basename.
C2 = unique hit, C3 = collision pick-latest."""
by_name = load_watcher_state()
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL
ORDER BY id;
""")
rows = cur.fetchall()
c2, c3 = [], []
skipped_no_match = 0
for r in rows:
# skip rows already targeted by C1 path
if r["fp"] and Path(r["fp"]).exists():
continue
src = r["source"]
if not src or src not in by_name:
skipped_no_match += 1
continue
candidates = by_name[src]
if len(candidates) == 1:
mt = fmt_ts_unix(candidates[0][1])
c2.append((r["id"], None, mt, None, "watcher_state_unique"))
else:
latest = max(candidates, key=lambda x: float(x[1]))
mt = fmt_ts_unix(latest[1])
c3.append((r["id"], None, mt, None, f"watcher_state_collision_pick_latest_of_{len(candidates)}"))
print(f" C2/C3 source-basename fallback: {len(c2)} unique, {len(c3)} collision, "
f"{skipped_no_match} unmatched (will fall to C4/C5)")
n2 = run_batch(cur, "C2 ca IS NULL AND watcher_state unique -> mtime", c2, apply_mode)
n3 = run_batch(cur, "C3 ca IS NULL AND watcher_state collision -> latest mtime", c3, apply_mode)
return n2 + n3
def batch_C4_chatgpt_export(cur, apply_mode):
index = load_chatgpt_index()
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands, unresolved = [], 0
for r in rows:
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = index.get(cid)
if ct is None:
unresolved += 1
continue
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
cands.append((r["id"], None, ct_iso, None, "chatgpt_export"))
print(f" C4 chatgpt export resolution: {len(cands)} resolved, {unresolved} unresolved (fall to C5)")
return run_batch(cur, "C4 type='chatgpt_conversation' AND ca IS NULL -> export create_time", cands, apply_mode)
def batch_C5_sentinel(cur, apply_mode):
cur.execute("""
SELECT id, type, source FROM embeddings WHERE created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], None, SENTINEL_ISO, None, "sentinel") for r in rows]
if cands:
sample_types = Counter(r["type"] for r in rows)
print(f" C5 residual sentinel rows by type: {dict(sample_types)}")
return run_batch(cur, f"C5 ca IS NULL residual -> sentinel {SENTINEL_ISO}", cands, apply_mode)
# ─── Pre/post counts ────────────────────────────────────────────────────────
def print_counts(cur, label):
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null
FROM embeddings;
""")
r = cur.fetchone()
print(f" [{label}] total={r['total']} type_null={r['type_null']} ca_null={r['ca_null']}")
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true", help="default false (dry-run)")
args = ap.parse_args()
apply_mode = args.apply
pg = get_pg()
cur = pg.cursor()
print(f"Mode: {'APPLY (writes will commit)' if apply_mode else 'DRY-RUN (no writes)'}")
print(f"Sentinel: {SENTINEL_ISO}")
if apply_mode:
assert_snapshot(cur)
header("PRE-COUNTS")
print_counts(cur, "before")
header("BATCHES")
n_t1 = batch_T1_type(cur, apply_mode)
n_c1 = batch_C1_filepath_stat(cur, apply_mode)
n_c2c3 = batch_C2_C3_watcher_state(cur, apply_mode)
n_c4 = batch_C4_chatgpt_export(cur, apply_mode)
n_c5 = batch_C5_sentinel(cur, apply_mode)
header("POST-COUNTS")
print_counts(cur, "after" if apply_mode else "after (in-transaction, will rollback)")
if apply_mode:
pg.commit()
print("\nCOMMITTED.")
else:
pg.rollback()
print("\nROLLED BACK (dry-run).")
print(f"\nSummary: T1={n_t1} C1={n_c1} C2+C3={n_c2c3} C4={n_c4} C5={n_c5}")
pg.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,557 @@
"""Read-only inspection for the embeddings.type / embeddings.created_at backfill (Improvement #2 / A.1).
Produces a survey of every backfill source-of-truth question without writing
to the database. Output is a human-readable report on stdout plus a JSON
sidecar at experiments/embeddings_backfill_inspection_<date>.json.
Sections:
1. Cohort recap (counts; should match prior investigation).
2. Cohort A type inference: extension classifier coverage.
3. created_at inference for cohort A + B-doc-old:
- rows with metadata.filepath: stat the file, check existence.
- rows without filepath: lookup source against watcher_state.json.
- filename-collision shape audit (live+backup, live+archive, ambiguous).
4. ChatGPT export resolution (Plan A.1 addition #1):
- existence of /home/aaron/nextcloud/.../ChatGPT Export/.
- sample 5 B-chatgpt rows; resolve convo_id -> create_time.
5. Sentinel date discovery (Plan A.1 addition #3):
- earliest non-NULL created_at per type (already-populated rows are the
lower bound for when the substrate started carrying timestamps).
- git log for the pgvector migration commit.
- any ChromaDB sqlite still on disk.
- propose a sentinel with reasoning, or flag as arbitrary.
6. 50-row stratified sample: derived (type, created_at, source) per row.
Usage: venv/bin/python3 scripts/experiments/embeddings_backfill_inspection.py
Read-only. No DB writes. No filesystem writes outside experiments/.
"""
import json
import os
import random
import re
import subprocess
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
NEXTCLOUD_ROOT = Path("/home/aaron/nextcloud/data/data/aaron/files")
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"embeddings_backfill_inspection_{datetime.now().strftime('%Y-%m-%d')}.json"
SUPPORTED_EXT = {".pdf", ".docx", ".pptx", ".txt", ".md"}
random.seed(20260503)
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(title):
bar = "=" * 70
print(f"\n{bar}\n{title}\n{bar}")
def sub(title):
print(f"\n--- {title} ---")
def fmt_ts_from_unix(ts):
"""Watcher state stores unix timestamps as strings."""
try:
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def fmt_ts_from_st_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
"""Returns (path -> mtime_str), and (basename -> [(path, mtime_str), ...])."""
state = json.loads(WATCHER_STATE.read_text())
by_path = state
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_path, by_name
def classify_collision_shape(paths):
"""Categorize a filename-collision group:
- 'live+backup' : exactly one path doesn't contain backup/.bak markers
and others do
- 'live+archive' : exactly one is outside Archive/ and others are inside
- 'multi-live' : >=2 paths look like live (no backup/archive markers)
- 'all-archive' : every path is inside Archive/ or backup-like
- 'other'
"""
def is_backup(p):
s = p.lower()
return ".bak" in s or "/backup" in s or "backups/" in s
def is_archive(p):
s = p.lower()
return "/archive/" in s
backups = [p for p in paths if is_backup(p)]
archives = [p for p in paths if is_archive(p)]
live = [p for p in paths if not is_backup(p) and not is_archive(p)]
if len(live) == 1 and len(backups) >= 1 and len(archives) == 0:
return "live+backup"
if len(live) == 1 and len(archives) >= 1 and len(backups) == 0:
return "live+archive"
if len(live) == 1 and (len(backups) + len(archives)) >= 1:
return "live+mixed-old"
if len(live) >= 2:
return "multi-live"
if len(live) == 0:
return "all-archive-or-backup"
return "other"
# ─── Section 1: Cohort recap ────────────────────────────────────────────────
def section_1_cohort_recap(cur):
header("1. COHORT RECAP")
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null,
COUNT(*) FILTER (WHERE type IS NULL AND created_at IS NULL) AS both_null,
COUNT(*) FILTER (WHERE type IS NOT NULL AND created_at IS NOT NULL) AS both_set
FROM embeddings;
""")
overall = cur.fetchone()
print(f"Total: {overall['total']} type_null: {overall['type_null']} "
f"ca_null: {overall['ca_null']} both_null: {overall['both_null']} "
f"both_set: {overall['both_set']}")
cur.execute("""
SELECT type, created_at IS NULL AS ca_null, COUNT(*) AS n
FROM embeddings GROUP BY type, ca_null ORDER BY type NULLS LAST, ca_null;
""")
cohorts = cur.fetchall()
sub("Per-(type, ca_null) cohorts")
for r in cohorts:
print(f" type={r['type'] or 'NULL':<22} ca_null={r['ca_null']!s:<5} n={r['n']}")
return {"overall": overall, "cohorts": cohorts}
# ─── Section 2: Cohort A type inference ─────────────────────────────────────
def section_2_type_inference(cur):
header("2. COHORT A TYPE INFERENCE (extension classifier)")
cur.execute("""
SELECT LOWER(SUBSTRING(source FROM '\.[^.]+$')) AS ext, COUNT(*) AS rows
FROM embeddings WHERE type IS NULL
GROUP BY ext ORDER BY rows DESC;
""")
by_ext = cur.fetchall()
classified = sum(r["rows"] for r in by_ext if r["ext"] in SUPPORTED_EXT)
unknown = sum(r["rows"] for r in by_ext if r["ext"] not in SUPPORTED_EXT)
print(f"NULL-type rows by extension:")
for r in by_ext:
flag = "OK" if r["ext"] in SUPPORTED_EXT else "??"
print(f" {flag} {r['ext'] or '(none)':<8} rows={r['rows']}")
print(f"\nClassified as 'document' via extension: {classified}")
print(f"Unclassifiable (no SUPPORTED extension): {unknown}")
return {"by_ext": by_ext, "classified": classified, "unclassifiable": unknown}
# ─── Section 3: created_at inference ────────────────────────────────────────
def section_3_created_at_inference(cur):
header("3. CREATED_AT INFERENCE — file-derived rows")
by_path, by_name = load_watcher_state()
print(f"watcher_state.json: {len(by_path)} tracked paths, "
f"{len(by_name)} distinct filenames, "
f"{sum(1 for v in by_name.values() if len(v) > 1)} filename collisions")
# 3a. Rows with metadata.filepath: probe stat()
sub("3a. Rows with metadata.filepath — stat probe")
cur.execute("""
SELECT id, source, metadata->>'filepath' AS filepath
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL;
""")
rows_with_fp = cur.fetchall()
fp_exists = 0
fp_missing = 0
fp_outside_root = 0
sample_resolved = []
for r in rows_with_fp:
p = Path(r["filepath"])
if not str(p).startswith(str(NEXTCLOUD_ROOT)):
fp_outside_root += 1
if p.exists():
fp_exists += 1
if len(sample_resolved) < 5:
sample_resolved.append({
"id": r["id"], "source": r["source"],
"filepath": str(p), "mtime": fmt_ts_from_st_mtime(p),
})
else:
fp_missing += 1
print(f" rows with metadata.filepath: {len(rows_with_fp)}")
print(f" exists on disk: {fp_exists}")
print(f" missing on disk: {fp_missing}")
print(f" outside Nextcloud root: {fp_outside_root}")
print(f" Sample of 5 resolved mtimes:")
for s in sample_resolved:
print(f" {s['id']:<15} {s['source'][:60]:<60} mtime={s['mtime']}")
# 3b. Rows without metadata.filepath: watcher_state lookup
sub("3b. Rows without metadata.filepath — watcher_state lookup")
cur.execute("""
SELECT id, source FROM embeddings
WHERE created_at IS NULL
AND metadata->>'filepath' IS NULL
AND type IS NULL OR (type='document' AND created_at IS NULL AND metadata->>'filepath' IS NULL);
""")
rows_no_fp = cur.fetchall()
# Distinct source basenames to look up
basenames_to_resolve = sorted({r["source"] for r in rows_no_fp if r["source"]})
n_resolved_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) == 1)
n_collision_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) > 1)
n_unfound = sum(1 for n in basenames_to_resolve if n not in by_name)
print(f" rows without filepath: {len(rows_no_fp)}")
print(f" distinct source basenames to resolve: {len(basenames_to_resolve)}")
print(f" unique watcher_state hit (no collision): {n_resolved_unique}")
print(f" collision in watcher_state (>1 path): {n_collision_unique}")
print(f" not in watcher_state at all: {n_unfound}")
# 3c. Collision-shape audit
sub("3c. Collision-shape audit — all collisions in watcher_state")
collisions = {n: [(p, m) for p, m in by_name[n]] for n in by_name if len(by_name[n]) > 1}
shape_counts = Counter()
rows_affected_by_shape = Counter()
# Map from basename to count of NULL-ca rows that need it (rows_no_fp)
rows_no_fp_by_name = Counter(r["source"] for r in rows_no_fp)
sample_per_shape = defaultdict(list)
for name, paths_mtimes in collisions.items():
paths = [p for p, _ in paths_mtimes]
shape = classify_collision_shape(paths)
shape_counts[shape] += 1
rows_affected_by_shape[shape] += rows_no_fp_by_name.get(name, 0)
if len(sample_per_shape[shape]) < 3:
entry = {
"name": name,
"rows_no_fp_using_this_name": rows_no_fp_by_name.get(name, 0),
"candidates": [
{"path": p, "mtime": fmt_ts_from_unix(m)}
for p, m in sorted(paths_mtimes, key=lambda x: -float(x[1]))
],
}
sample_per_shape[shape].append(entry)
print(f" collisions in watcher_state: {len(collisions)}")
print(f" shape breakdown:")
for shape, n in shape_counts.most_common():
print(f" {shape:<22} collisions={n:<4} rows_affected={rows_affected_by_shape[shape]}")
print(f"\n Up-to-3 sample collisions per shape (sorted by mtime desc):")
for shape, samples in sample_per_shape.items():
print(f" [{shape}]")
for s in samples:
print(f" {s['name']} (rows_no_fp using this name: {s['rows_no_fp_using_this_name']})")
for c in s["candidates"]:
print(f" {c['mtime']} {c['path']}")
return {
"watcher_state_paths": len(by_path),
"watcher_state_basenames": len(by_name),
"watcher_state_collisions": len(collisions),
"rows_with_filepath": {
"total": len(rows_with_fp),
"exists": fp_exists, "missing": fp_missing,
"outside_root": fp_outside_root,
"sample": sample_resolved,
},
"rows_without_filepath": {
"total": len(rows_no_fp),
"distinct_basenames": len(basenames_to_resolve),
"unique_hit": n_resolved_unique,
"collision_hit": n_collision_unique,
"unfound": n_unfound,
},
"collision_shapes": {
"total": len(collisions),
"shape_counts": dict(shape_counts),
"rows_affected_by_shape": dict(rows_affected_by_shape),
"samples": {k: v for k, v in sample_per_shape.items()},
},
}
# ─── Section 4: ChatGPT export resolution ───────────────────────────────────
def section_4_chatgpt_export(cur):
header("4. CHATGPT EXPORT RESOLUTION (Plan addition #1)")
print(f"Probing: {CHATGPT_EXPORT_DIR}")
if not CHATGPT_EXPORT_DIR.exists():
print(" NOT FOUND — plan on sentinel for entire B-chatgpt cohort.")
return {"export_dir_exists": False, "files": []}
files = sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json"))
print(f" found {len(files)} export file(s):")
for f in files:
print(f" {f.name} size={f.stat().st_size:,} mtime={fmt_ts_from_st_mtime(f)}")
# Build convo_id -> create_time index from all export files.
print("\nLoading export(s) to build convo_id -> create_time index...")
convo_index = {}
for f in files:
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception as e:
print(f" failed to parse {f.name}: {e}")
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
convo_index[cid] = ct
print(f" indexed {len(convo_index)} conversations across {len(files)} export files")
# Sample 5 chatgpt_conversation rows; resolve.
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL
ORDER BY random() LIMIT 5;
""")
sample = cur.fetchall()
sub("Sample of 5 B-chatgpt rows: convo lookup")
resolved = 0
sample_results = []
for r in sample:
# IDs look like chatgpt_<uuid>_<idx>; uuid extends until last underscore.
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = convo_index.get(cid)
ct_iso = None
if ct is not None:
try:
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
ct_iso = None
if ct_iso:
resolved += 1
sample_results.append({
"id": r["id"], "source": r["source"], "convo_id": cid,
"create_time": ct, "create_time_iso": ct_iso,
"resolved": ct_iso is not None,
})
print(f" {r['id']} cid={cid}")
print(f" -> create_time={ct} iso={ct_iso}")
print(f"\nResolved {resolved}/5. "
f"{'PROCEED with re-derive for full cohort.' if resolved == 5 else 'PARTIAL — plan re-derive + sentinel for unresolved.'}")
# Estimate full-cohort coverage by counting how many B-chatgpt convo_ids appear in the index.
cur.execute("""
SELECT DISTINCT regexp_replace(id, '^chatgpt_(.+)_\\d+$', '\\1') AS cid
FROM embeddings WHERE type='chatgpt_conversation' AND created_at IS NULL;
""")
distinct_cids = [r["cid"] for r in cur.fetchall()]
in_index = sum(1 for c in distinct_cids if c in convo_index)
print(f"Full-cohort coverage estimate: {in_index} / {len(distinct_cids)} distinct convo_ids "
f"resolvable from export.")
return {
"export_dir_exists": True,
"files": [{"name": f.name, "size": f.stat().st_size, "mtime": fmt_ts_from_st_mtime(f)} for f in files],
"convo_index_size": len(convo_index),
"sample_results": sample_results,
"sample_resolved": resolved,
"full_cohort": {
"distinct_convo_ids": len(distinct_cids),
"resolvable_from_export": in_index,
"unresolvable": len(distinct_cids) - in_index,
},
}
# ─── Section 5: Sentinel date discovery ─────────────────────────────────────
def section_5_sentinel(cur):
header("5. SENTINEL DATE DISCOVERY (Plan addition #3)")
# 5a. Earliest non-NULL created_at per type: lower bound on substrate age.
sub("5a. Earliest non-NULL created_at per type")
cur.execute("""
SELECT type, MIN(created_at) AS earliest, MAX(created_at) AS latest, COUNT(*) AS rows
FROM embeddings WHERE created_at IS NOT NULL GROUP BY type ORDER BY type;
""")
rows = cur.fetchall()
for r in rows:
print(f" {r['type']:<22} earliest={r['earliest']:<32} latest={r['latest']}")
# 5b. git log for the pgvector-migration commit.
sub("5b. Git log — pgvector migration commits")
git_findings = []
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s",
"--", "deprecated/migrate_to_pgvector.py", "scripts/migrate_to_pgvector.py"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
for line in out.stdout.strip().splitlines():
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log failed: {e}")
# Also: when did the api/ingest scripts cut over to pgvector?
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s", "--grep=pgvector", "-i"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
print("\n Commits mentioning pgvector:")
for line in out.stdout.strip().splitlines()[:10]:
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log (pgvector grep) failed: {e}")
# 5c. ChromaDB sqlite still on disk?
sub("5c. ChromaDB dump on disk?")
candidates = []
for root in [Path.home() / "aaronai", Path.home() / "aaronai" / "db"]:
if root.exists():
for p in root.rglob("chroma*.sqlite*"):
candidates.append({"path": str(p), "mtime": fmt_ts_from_st_mtime(p)})
if candidates:
for c in candidates:
print(f" found: {c['path']} mtime={c['mtime']}")
else:
print(" no ChromaDB sqlite found under ~/aaronai")
# 5d. Propose sentinel.
sub("5d. Sentinel proposal")
# Earliest doc cutover: per query, document=2026-04-30. Migration commit f78b830 was
# 2026-04-26. Most defensible sentinel for "rows that entered pgvector before NOW()
# writes were canonical" = the migration commit date.
proposed = "2026-04-26T00:00:00Z"
reasoning = (
"git f78b830 'Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, "
"dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL "
"created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL "
"created_at all predate F11 and most predate the pgvector cutover itself. "
"2026-04-26 is the date the ChromaDB->pgvector migration script was committed, "
"so any row currently in the embeddings table with NULL created_at must have been "
"ingested on or after that date (when the table came into existence in current form). "
"It is the tightest defensible upper bound on 'the row entered pgvector before "
"timestamps were tracked', so it is the right sentinel."
)
print(f" Proposed sentinel: {proposed}")
print(f" Reasoning: {reasoning}")
return {
"earliest_per_type": rows,
"git_findings": git_findings,
"chromadb_candidates": candidates,
"proposed_sentinel": proposed,
"reasoning": reasoning,
}
# ─── Section 6: 50-row stratified sample ────────────────────────────────────
def section_6_stratified_sample(cur, sentinel_iso):
header("6. 50-ROW STRATIFIED SAMPLE — derived (type, created_at, source)")
by_path, by_name = load_watcher_state()
cohorts = [
("A (type NULL, ca NULL)", "type IS NULL AND created_at IS NULL", 10),
("B-doc-old (type='document', ca NULL)", "type='document' AND created_at IS NULL", 10),
("B-chatgpt (type='chatgpt_conversation', ca NULL)", "type='chatgpt_conversation' AND created_at IS NULL", 10),
("C-doc-new (type='document', ca set)", "type='document' AND created_at IS NOT NULL", 10),
("C-claude (type='claude_conversation', ca set)", "type='claude_conversation' AND created_at IS NOT NULL", 5),
("C-aaronai (type='aaronai_conversation', ca set)", "type='aaronai_conversation' AND created_at IS NOT NULL", 5),
]
samples = []
for label, predicate, n in cohorts:
sub(f"{label} (sample size: {n})")
cur.execute(f"""
SELECT id, source, type, created_at, metadata
FROM embeddings WHERE {predicate}
ORDER BY random() LIMIT %s;
""", (n,))
rows = cur.fetchall()
for r in rows:
row_meta = r["metadata"] or {}
fp = row_meta.get("filepath")
inferred_type = r["type"] or ("document" if (r["source"] or "").lower().endswith(tuple(SUPPORTED_EXT)) else "?")
inferred_ca = r["created_at"]
inferred_ca_source = "preserved" if inferred_ca else None
if not inferred_ca:
if fp and Path(fp).exists():
inferred_ca = fmt_ts_from_st_mtime(Path(fp))
inferred_ca_source = "filepath_stat"
elif r["source"] and r["source"] in by_name:
candidates = by_name[r["source"]]
if len(candidates) == 1:
inferred_ca = fmt_ts_from_unix(candidates[0][1])
inferred_ca_source = "watcher_state_unique"
else:
# take most recent
latest = max(candidates, key=lambda x: float(x[1]))
inferred_ca = fmt_ts_from_unix(latest[1])
inferred_ca_source = f"watcher_state_collision_pick_latest_of_{len(candidates)}"
else:
inferred_ca = sentinel_iso
inferred_ca_source = "sentinel"
print(f" id={r['id']:<22} src={(r['source'] or '')[:38]:<38}")
print(f" existing: type={r['type']!r:<22} ca={r['created_at']!r}")
print(f" inferred: type={inferred_type!r:<22} ca={inferred_ca!r} ({inferred_ca_source})")
samples.append({
"cohort": label, "id": r["id"], "source": r["source"],
"existing_type": r["type"], "existing_ca": r["created_at"],
"inferred_type": inferred_type, "inferred_ca": inferred_ca,
"inferred_ca_source": inferred_ca_source,
})
return samples
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
pg = get_pg()
cur = pg.cursor()
out = {"generated_at": datetime.now(timezone.utc).isoformat()}
out["section_1"] = section_1_cohort_recap(cur)
out["section_2"] = section_2_type_inference(cur)
out["section_3"] = section_3_created_at_inference(cur)
out["section_4"] = section_4_chatgpt_export(cur)
out["section_5"] = section_5_sentinel(cur)
sentinel_iso = out["section_5"]["proposed_sentinel"]
out["section_6"] = section_6_stratified_sample(cur, sentinel_iso)
pg.close()
# JSON sidecar — strip non-serializables.
def _serialize(o):
if isinstance(o, datetime):
return o.isoformat()
return str(o)
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
OUT_PATH.write_text(json.dumps(out, indent=2, default=_serialize))
print(f"\nJSON sidecar written: {OUT_PATH}")
if __name__ == "__main__":
main()
@@ -0,0 +1,296 @@
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
sidecar for diffing across runs.
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
"""
import os
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
import psycopg2
from dotenv import load_dotenv
load_dotenv()
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
def normalize(label):
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
def folder_bin(source):
"""Classify source by type. stage_3_queue stores bare filenames, so we
bin by what kind of file it is, not where it lives in the tree."""
if not source:
return "unknown"
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
return "conversation" # bypasses Stage 2/3, will not appear here
s = source.lower()
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
return "voice_note"
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
return "dream_output"
if s.endswith(".md"):
return "markdown"
if s.endswith(".pdf"):
return "pdf"
if s.endswith(".docx") or s.endswith(".doc"):
return "docx"
if s.endswith(".pptx") or s.endswith(".ppt"):
return "pptx"
if s.endswith(".txt"):
return "txt"
return "other"
def fetch_rows(cur):
cur.execute("""
SELECT source, char_length, active_frames, worker_version, raw_metadata
FROM stage2_frames_v
""")
rows = []
for source, char_length, frames, worker_version, raw in cur.fetchall():
if not isinstance(frames, list):
continue
rows.append({
"source": source,
"char_length": char_length,
"frames": [str(f) for f in frames if f],
"worker_version": worker_version,
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
})
return rows
def section_frequency(rows):
counter = Counter()
for r in rows:
for f in r["frames"]:
counter[f] += 1
return counter
def section_hygiene(frequency):
"""Group raw labels by normalized form; flag collisions."""
groups = defaultdict(list)
for raw, count in frequency.items():
groups[normalize(raw)].append((raw, count))
collisions = {k: v for k, v in groups.items() if len(v) > 1}
return collisions
def section_per_doc_count(rows):
counts = Counter(len(r["frames"]) for r in rows)
return counts
def section_cooccurrence(rows, top_frames):
top_set = set(top_frames)
pair_counts = Counter()
for r in rows:
present = [f for f in r["frames"] if f in top_set]
for i in range(len(present)):
for j in range(i + 1, len(present)):
a, b = sorted([present[i], present[j]])
pair_counts[(a, b)] += 1
return pair_counts
def section_folder_crosstab(rows, top_frames):
top_set = set(top_frames)
table = defaultdict(Counter) # frame -> bin -> count
bin_totals = Counter()
for r in rows:
b = folder_bin(r["source"])
bin_totals[b] += 1
for f in r["frames"]:
if f in top_set:
table[f][b] += 1
return table, bin_totals
def section_worker_versions(rows):
counter = Counter(r["worker_version"] or "unknown" for r in rows)
raw_keys_by_version = defaultdict(Counter)
for r in rows:
v = r["worker_version"] or "unknown"
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
return counter, raw_keys_by_version
def section_data_gap(cur):
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
cur.execute("""
SELECT source, char_length
FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
missing = cur.fetchall()
by_bin = Counter(folder_bin(s) for s, _ in missing)
char_lengths = [c for _, c in missing]
return {
"count": len(missing),
"by_type_bin": dict(by_bin),
"char_length": {
"min": min(char_lengths) if char_lengths else None,
"max": max(char_lengths) if char_lengths else None,
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
},
"sample_sources": [s for s, _ in missing[:10]],
}
def section_corpus_coverage(cur):
"""How much of the embeddings corpus has frame coverage?"""
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
total = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM embeddings
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
""")
conversations = cur.fetchone()[0]
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
with_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
short_no_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE failed_at IS NOT NULL
""")
failed = cur.fetchone()[0]
return {
"total_distinct_sources_in_embeddings": total,
"conversations_no_frames_by_design": conversations,
"files_with_frames": with_frames,
"files_short_no_frames": short_no_frames,
"files_stage2_failed": failed,
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
}
def main():
conn = psycopg2.connect(os.environ["PG_DSN"])
cur = conn.cursor()
rows = fetch_rows(cur)
n_docs = len(rows)
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
# 1. Frequency
freq = section_frequency(rows)
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
for label, count in freq.most_common(30):
print(f" {count:5d} {label}")
print()
# 2. Hygiene
collisions = section_hygiene(freq)
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
print(f" '{norm}': {variant_str}")
print()
# 3. Per-doc frame count
per_doc = section_per_doc_count(rows)
print("--- 3. Per-doc frame count ---")
for n in sorted(per_doc):
print(f" {n} frames: {per_doc[n]} docs")
print()
# 4. Co-occurrence (top-K)
top_frames = [f for f, _ in freq.most_common(TOP_K)]
pairs = section_cooccurrence(rows, top_frames)
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
for (a, b), count in pairs.most_common(30):
print(f" {count:4d} {a} × {b}")
print()
# 5. Folder cross-tab
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
bins_sorted = [b for b, _ in bin_totals.most_common()]
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
for f in top_frames:
row_data = crosstab[f]
if not row_data:
continue
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
print(f" {f}: {cells}")
print()
# 6. Worker versions
versions, keys_by_version = section_worker_versions(rows)
print("--- 6. Worker version split ---")
for v, count in versions.most_common():
print(f" v{v}: {count} docs")
top_shapes = keys_by_version[v].most_common(3)
for keys, kcount in top_shapes:
print(f" {kcount} docs with keys={list(keys)}")
print()
# 7. Data gap
gap = section_data_gap(cur)
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
print(f" count: {gap['count']}")
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
print(f" by type bin: {gap['by_type_bin']}")
print(f" sample sources: {gap['sample_sources']}")
print()
# 8. Corpus coverage
coverage = section_corpus_coverage(cur)
print("--- 8. Corpus-wide frame coverage ---")
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
print(f" files with frames: {coverage['files_with_frames']}")
print(f" files short, no frames: {coverage['files_short_no_frames']}")
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
print()
# JSON sidecar
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
sidecar = {
"generated_at": datetime.now().isoformat(),
"n_docs_with_frames": n_docs,
"n_distinct_labels": len(freq),
"top_30_frames": freq.most_common(30),
"label_collisions": {
k: [(r, c) for r, c in v] for k, v in collisions.items()
},
"per_doc_frame_count": dict(per_doc),
"top_30_pairs": [
{"a": a, "b": b, "count": c}
for (a, b), c in pairs.most_common(30)
],
"folder_crosstab": {
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
},
"bin_totals": dict(bin_totals),
"worker_versions": dict(versions),
"data_gap": gap,
"corpus_coverage": coverage,
}
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
print(f"JSON sidecar written: {OUT_PATH}")
cur.close()
conn.close()
if __name__ == "__main__":
main()
+9
View File
@@ -126,6 +126,15 @@ def run():
embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
if not meta.get("type"):
raise ValueError(
f"chunk {chunk_id!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
# ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
# COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
# which advances on activity. Re-running this script on an active conv
# should refresh the timestamp, not preserve the first-seen one.
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)