embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)

Writers now enforce type and created_at:
  - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
    'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
    DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
    created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
    re-ingest re-classifies type but does not overwrite a backfilled mtime.
  - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
    EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
    convo.updated_at; re-runs should refresh).
  - Column-level NOT NULL is not added; application-layer raise gives a
    faster, more debuggable failure than a Postgres constraint error.

Retrieval propagates type into chunks:
  - retrieve() SELECT now includes type; chunk dicts carry "type": etype.
  - WHERE clause built dynamically from excluded_sources and the new
    --type-filter CLI arg (experimental, default None, pgvector retrieval
    only — Graphiti chunks have no embeddings.type to filter on).
  - retrieve_graphiti unchanged; its chunks lack the type field.

Manifests carry type_distribution per stage:
  - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
    early_rem, late_rem — a Counter over chunk types, filtering None so
    Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
    distribution. Pgvector chunks always carry type post-backfill; if None
    appears, the backfill or writer enforcement has regressed.

Verification:
  B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
       all 84 chunks preserved created_at=2026-04-27T06:11:55Z
  B2 missing-type assertion raises ValueError, no row leaked to embeddings
  B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
       type_filter only, excl 2 elems, excl 1 elem edge case, both};
       all five plans use HNSW index scan with correct Filter clauses
  C1 retrieve("nrem") returns 8 chunks each carrying "type" key
  C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
       2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
       no single type >=90%)

The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
This commit is contained in:
2026-05-04 00:15:43 +00:00
parent 3c7c228db0
commit 7c7b649775
3 changed files with 69 additions and 28 deletions
+16 -1
View File
@@ -101,11 +101,24 @@ def chunk_and_embed(text: str,
def write_embeddings_batch(conn, batch: list[dict]) -> int:
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits."""
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits.
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
callers do not need to provide it. The application-layer assertion is the
primary enforcement point for type — the column lacks NOT NULL because
historical NULLs were resolved by the Improvement #2 backfill, and a
Python-level raise gives a faster, more debuggable failure than a
Postgres constraint error.
"""
if not batch:
return 0
cur = conn.cursor()
for row in batch:
if not row.get("type"):
raise ValueError(
f"row {row.get('id')!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
@@ -113,6 +126,8 @@ def write_embeddings_batch(conn, batch: list[dict]) -> int:
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
type = EXCLUDED.type,
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
metadata = EXCLUDED.metadata
""", (row["id"], row["document"], row["embedding"],
row["source"], row["type"], json.dumps(row["metadata"])))