embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)

Writers now enforce type and created_at: - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks 'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a re-ingest re-classifies type but does not overwrite a backfilled mtime. - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks convo.updated_at; re-runs should refresh). - Column-level NOT NULL is not added; application-layer raise gives a faster, more debuggable failure than a Postgres constraint error. Retrieval propagates type into chunks: - retrieve() SELECT now includes type; chunk dicts carry "type": etype. - WHERE clause built dynamically from excluded_sources and the new --type-filter CLI arg (experimental, default None, pgvector retrieval only — Graphiti chunks have no embeddings.type to filter on). - retrieve_graphiti unchanged; its chunks lack the type field. Manifests carry type_distribution per stage: - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem, early_rem, late_rem — a Counter over chunk types, filtering None so Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the distribution. Pgvector chunks always carry type post-backfill; if None appears, the backfill or writer enforcement has regressed. Verification: B1 force re-ingest of "Finite and infinite games -- James Carse.pdf": all 84 chunks preserved created_at=2026-04-27T06:11:55Z B2 missing-type assertion raises ValueError, no row leaked to embeddings B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter, type_filter only, excl 2 elems, excl 1 elem edge case, both}; all five plans use HNSW index scan with correct Filter clauses C1 retrieve("nrem") returns 8 chunks each carrying "type" key C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} — 2 distinct types, 62.5/37.5 split (looser bar: >=2 types, no single type >=90%) The type and created_at fields are now load-bearing: every dream manifest emits type_distribution per stage. Reverting the backfill makes the distribution show NULLs at every dream run.
2026-05-04 00:15:43 +00:00
parent 3c7c228db0
commit 7c7b649775
3 changed files with 69 additions and 28 deletions
@@ -126,6 +126,15 @@ def run():
        embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
        
        for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
+            if not meta.get("type"):
+                raise ValueError(
+                    f"chunk {chunk_id!r} missing 'type'; writers must supply it "
+                    f"(see Improvement #2 in docs/birdai-component-inventory)"
+                )
+            # ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
+            # COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
+            # which advances on activity. Re-running this script on an active conv
+            # should refresh the timestamp, not preserve the first-seen one.
            cur.execute("""
                INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
                VALUES (%s, %s, %s::vector, %s, %s, %s, %s)