aaronAI

Author	SHA1	Message	Date
aaron	3c7c228db0	embeddings: backfill type and created_at (Improvement #2 part A) Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit) and 12,109 created_at-NULL rows via five batches: C1 filepath_stat: 9,649 filesystem mtime via metadata.filepath C2 watcher_state_unique: 676 unique source-name lookup in watcher_state C3 watcher_state_collision_pick_latest_of_N: 234 collision; most-recent watcher mtime C4 chatgpt_export: 1,548 convo create_time from export JSONs (168/168 distinct convo_ids resolved) C5 sentinel: 2 2026-04-26T00:00:00Z (pgvector migration date) Provenance written to metadata.type_source and metadata.created_at_source on every row changed by this run. type_source is empty on rows where the type field was already populated pre-run; in those cases the snapshot table is the source of truth for what changed. Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type, created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join). Verification: V1 live counts: type_null=0 ca_null=0 V2 spot-check 11 rows across cohorts: provenance correct V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved V4 cross-check vs snapshot: reconciles per-provenance to dry-run Read-side use (B + C: writer enforcement + minimal retrieval read) deferred to a separate session. The backfill is complete and verified, but the type and created_at fields are not yet load-bearing — every current reader still ignores them. Without B+C this lands as data prep, not behavior change.	2026-05-03 23:58:53 +00:00
aaron	ed2d090afc	experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3 ) Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only.	2026-05-03 20:32:37 +00:00
aaron	3f7fba7e0e	scripts/: separate production from experimental and deprecated Moves 28 experiment scripts to scripts/experiments/ (E1, E1.4, E1.6, E2, base_class, cascade, cost_test, briefing, consistency, token series). Moves 2 dissolved-layer scripts to scripts/deprecated/ (consolidator_v0_1.py, tier1_migration.py — under the bespoke decision both target retired substrate work). Removes 19 .bak* files from disk (gitignored, never tracked; git history is the durable record of every prior version). The 11 production scripts remain in scripts/. All systemd ExecStart paths, api.py subprocess calls, and cron jobs continue to resolve correctly — verified by grep against /etc/systemd/system/aaronai-*.service, scripts/ references in api.py, and the user crontab. Track 1 inventory cross-cutting finding: scripts/ mixed 11 production files with 32 experimental scripts and ~20 .bak files. After this commit a clean-room reader can identify the live workers from a directory listing alone. Found by Track 1 inventory 2026-05-02. See ~/aaronai/docs/scripts-reorg-plan-2026-05-02.md for full reasoning. After commit, run: 1. git log --oneline -3 — show the new commit on top 2. git status — confirm clean working tree (modulo the docs/ untracked files which are intentional)	2026-05-02 23:28:24 +00:00
aaron	f11cacd9c9	add experiment scripts and results; watcher.py latest changes	2026-04-30 18:06:03 +00:00
aaron	91166367fa	E3: add Graphiti retrieval branch to dream.py, E3 experiment script with blinding	2026-04-30 17:17:28 +00:00

5 Commits