Commit Graph

1 Commits

Author SHA1 Message Date
aaron 3c7c228db0 embeddings: backfill type and created_at (Improvement #2 part A)
Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit)
and 12,109 created_at-NULL rows via five batches:

  C1 filepath_stat:        9,649  filesystem mtime via metadata.filepath
  C2 watcher_state_unique:   676  unique source-name lookup in watcher_state
  C3 watcher_state_collision_pick_latest_of_N:
                             234  collision; most-recent watcher mtime
  C4 chatgpt_export:       1,548  convo create_time from export JSONs
                                  (168/168 distinct convo_ids resolved)
  C5 sentinel:                 2  2026-04-26T00:00:00Z (pgvector migration date)

Provenance written to metadata.type_source and metadata.created_at_source
on every row changed by this run. type_source is empty on rows where the
type field was already populated pre-run; in those cases the snapshot
table is the source of truth for what changed.

Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type,
created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join).

Verification:
  V1 live counts:      type_null=0  ca_null=0
  V2 spot-check 11 rows across cohorts: provenance correct
  V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved
  V4 cross-check vs snapshot: reconciles per-provenance to dry-run

Read-side use (B + C: writer enforcement + minimal retrieval read) deferred
to a separate session. The backfill is complete and verified, but the type
and created_at fields are not yet load-bearing — every current reader still
ignores them. Without B+C this lands as data prep, not behavior change.
2026-05-03 23:58:53 +00:00