stage3_worker: v2.3 — bulk-vs-single-episode routing on Stage 2 state-type

Reads new routing columns from stage_3_queue (state_type, state_type_confidence, supersedes_prior_state, state_type_rationale) and dispatches each row to one of two ingest pathways: - BULK pathway (existing, renamed from ingest_to_graphiti to ingest_bulk): safer-cheaper default. Used when supersedes=false OR confidence=low OR routing fields are NULL (legacy rows). Skips edge invalidation per graphiti-core's bulk semantics. - SINGLE-EPISODE pathway (new, ingest_single_episode): used only when supersedes_prior_state=true AND confidence in {medium, high}. Per-chunk POST to /episodes (singular endpoint) with shared saga tag. Each call independent — own timeout, own retry envelope. Routing decision isolated in should_route_single_episode() with unit-tested truth table covering all eight (supersedes × confidence) combinations. Per-chunk heartbeat (heartbeat_row): single-episode pathway updates stage_3_queue.started_at after each successful chunk POST so a long-running document doesn't cross the 10-minute stale threshold mid-process and get re-dequeued. started_at semantics now: 'last activity timestamp' rather than 'began at'. Best-effort; failures logged not raised. Partial-success on chunk failure: previously-committed chunks stay in the graph; the function raises with detail (single_episode_partial: chunk N/M failed, succeeded K). The row is marked failed_at with that detail. Re- ingestion would re-POST chunks 1..N-1 against the graph; graphiti's dedup handles them as no-ops. DB connection scoping: process_one no longer holds one Postgres connection across the whole ingest call (which can run an hour for long single-episode documents). Each DB write gets a short-lived connection. Phase A item 3 of three. Closes the mechanical-patches block. Item 4 (custom_extraction_instructions text design) is the remaining intellectual work; sidecar and worker plumbing is now ready for it.
2026-05-01 19:07:41 +00:00
parent 70e87e3ab5
commit e7de7fb64b
1 changed files with 202 additions and 29 deletions
@@ -1,22 +1,45 @@
 #!/usr/bin/env python3
 """
-Stage 3 Worker — Graphiti Ingest with Taxonomy-Free Orientation
-Polls stage_3_queue, chunks documents, ingests as episodic saga to Graphiti.
+Stage 3 Worker — Graphiti Ingest with Bulk-vs-Single-Episode Routing

-Chunking rationale: Large documents sent as single episodes cause FalkorDB
-write lock contention during entity deduplication. Chunking at ~500 words
-(matching Stage 1) produces smaller deduplication passes that don't block.
-Each document's chunks are linked via Graphiti's saga mechanism, preserving
-document structure in the graph.
+Polls stage_3_queue, routes each row to one of two ingest pathways based on
+state-type classification produced by Stage 2:

-Saga-size limit (MAX_CHUNKS_PER_SAGA): 2026-05-01 incident showed sagas of
-17 and 19 chunks deadlock the sidecar's Python-side coordination. Documents
-producing more than MAX_CHUNKS_PER_SAGA chunks are split into multiple bulk
-commits, each tagged with the same saga value so Graphiti still links them.
+  - BULK pathway (existing): supersedes_prior_state=false OR confidence=low
+    OR routing fields missing. Fast, no temporal invalidation.

-Wedge detection: 2026-05-01 incident also surfaced the asymmetry with Stage 2 —
-Stage 3 had no recovery path when the sidecar deadlocked. Now mirrors Stage 2's
-consecutive_failures pattern with sidecar restart on threshold.
+  - SINGLE-EPISODE pathway (new): supersedes_prior_state=true AND
+    confidence in {medium, high}. Per-chunk POST to /episodes with shared
+    saga tag, full edge invalidation, per-chunk timeout/retry independence.
+
+Routing rationale: the single-episode pathway is the correct API per
+graphiti-core's docs for content that supersedes prior facts (it does
+edge invalidation that bulk skips). It costs more per chunk because of
+the resolve_edge LLM call; the routing rule keeps that cost bounded to
+content that actually needs it.
+
+Chunking rationale (preserved from prior versions): Large documents sent
+as single episodes cause FalkorDB write lock contention during entity
+deduplication. Chunking at ~500 words (matching Stage 1) produces smaller
+deduplication passes that don't block. Each document's chunks are linked
+via Graphiti's saga mechanism, preserving document structure in the graph.
+
+Per-chunk heartbeat: single-episode pathway updates stage_3_queue.started_at
+after each successful chunk POST so a long-running document doesn't cross
+the 10-minute stale threshold mid-process and get re-dequeued by another
+worker (or the same worker on next loop iteration). started_at thus means
+"last activity timestamp" rather than "began at" — semantics that match
+the dequeue query's intent (catch dead workers, not slow ones).
+
+Saga-size limit (MAX_CHUNKS_PER_SAGA): 2026-05-01 incident showed bulk
+sagas of 17 and 19 chunks deadlock the sidecar's Python-side coordination.
+Documents producing more than MAX_CHUNKS_PER_SAGA chunks on the bulk
+pathway are split into multiple bulk commits, each tagged with the same
+saga value so Graphiti still links them. The single-episode pathway
+doesn't need this split since each chunk is its own POST.
+
+Wedge detection: mirrors Stage 2's consecutive_failures pattern with
+sidecar restart on threshold.

 Runs as systemd service: aaronai-stage3.service
 """
@@ -44,17 +67,23 @@ HEARTBEAT_FILE = Path("/var/log/aaronai/stage3-heartbeat")
 RETRY_ATTEMPTS = 2
 POLL_INTERVAL = 5
 INGEST_TIMEOUT = 600
-WORKER_VERSION = "2.2"
+WORKER_VERSION = "2.3"

 # Match Stage 1 chunking parameters
 CHUNK_SIZE_WORDS = 500
 CHUNK_OVERLAP_WORDS = 50
 # Documents under this threshold ingested as single episode (no chunking overhead)
 SINGLE_EPISODE_THRESHOLD = 1500
-# Sagas larger than this many chunks split into multiple commits
+# Bulk-pathway sagas larger than this many chunks split into multiple commits
 # (2026-05-01 incident: 17 and 19 chunk sagas deadlocked sidecar)
 MAX_CHUNKS_PER_SAGA = 10

+# Routing rule: single-episode pathway requires both signals positive.
+# Anything else (false, NULL, low confidence) routes to bulk — the
+# safer-cheaper default. Mistral parse drift can't accidentally trigger
+# the expensive pathway.
+HIGH_TRUST_CONFIDENCE = ("medium", "high")
+

 def get_pg():
    return psycopg2.connect(PG_DSN)
@@ -109,6 +138,22 @@ def chunk_text(text, chunk_size=CHUNK_SIZE_WORDS, overlap=CHUNK_OVERLAP_WORDS):
    return chunks


+def heartbeat_row(row_id):
+    """Refresh stage_3_queue.started_at to NOW() so a long-running single-episode
+    ingest doesn't cross the 10-minute stale threshold mid-process. Called
+    after each successful chunk POST. Best-effort: failures are logged but
+    don't fail the chunk — the worst case is a stale-threshold re-dequeue,
+    which graphiti's dedup will handle as a no-op."""
+    try:
+        pg = get_pg()
+        cur = pg.cursor()
+        cur.execute("UPDATE stage_3_queue SET started_at = NOW() WHERE id = %s", (row_id,))
+        pg.commit()
+        pg.close()
+    except Exception as e:
+        log.warning(f"  Heartbeat update failed (continuing): {e}")
+
+
 def post_bulk(payload, batch_label=""):
    """Single POST to /episodes/bulk with consistent error handling."""
    resp = requests.post(
@@ -122,16 +167,32 @@ def post_bulk(payload, batch_label=""):
    return resp.json()


-def ingest_to_graphiti(source, full_text, orientation):
-    """
-    Ingest document to Graphiti as chunked episodes linked by saga.
+def post_episode(payload, episode_label=""):
+    """Single POST to /episodes (singular) with consistent error handling.
+    Used by the single-episode pathway, one call per chunk."""
+    resp = requests.post(
+        f"{GRAPHITI_URL}/episodes",
+        json=payload,
+        timeout=INGEST_TIMEOUT
+    )
+    if not resp.ok:
+        prefix = f"{episode_label} " if episode_label else ""
+        raise RuntimeError(f"{prefix}Sidecar {resp.status_code}: {resp.text[:500]}")
+    return resp.json()
+
+
+def ingest_bulk(source, full_text, orientation):
+    """
+    Bulk-pathway ingest: documents that don't supersede prior state.
+    Skips edge invalidation. Cheap. Three sub-paths by document size:

-    Three paths:
    - Short documents (<SINGLE_EPISODE_THRESHOLD): single episode, no saga
+      [note: 'single episode' here means one bulk call with one item, NOT
+       the single-episode-pathway; naming overlap is unfortunate but local]
    - Medium documents (chunks <= MAX_CHUNKS_PER_SAGA): one bulk commit, saga-linked
    - Large documents (chunks > MAX_CHUNKS_PER_SAGA): split into batches of
-      MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga tag
-      so Graphiti links them as one document unit
+      MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga
+      tag so Graphiti links them as one document unit
    """
    char_length = len(full_text)

@@ -142,7 +203,7 @@ def ingest_to_graphiti(source, full_text, orientation):
            "source_description": orientation,
            "timestamp": datetime.now().isoformat(),
        }]
-        log.info(f"  Single episode ({char_length} chars)")
+        log.info(f"  [bulk] Single episode ({char_length} chars)")
        return post_bulk({"episodes": episodes, "group_id": "aaron"})

    chunks = chunk_text(full_text)
@@ -158,7 +219,7 @@ def ingest_to_graphiti(source, full_text, orientation):
            }
            for i, chunk in enumerate(chunks)
        ]
-        log.info(f"  Chunked into {total_chunks} episodes ({char_length} chars)")
+        log.info(f"  [bulk] Chunked into {total_chunks} episodes ({char_length} chars)")
        return post_bulk(
            {"episodes": episodes, "group_id": "aaron", "saga": source}
        )
@@ -166,7 +227,7 @@ def ingest_to_graphiti(source, full_text, orientation):
    # Large document: split into batches sharing the same saga tag
    batch_count = (total_chunks + MAX_CHUNKS_PER_SAGA - 1) // MAX_CHUNKS_PER_SAGA
    log.info(
-        f"  Chunked into {total_chunks} episodes ({char_length} chars); "
+        f"  [bulk] Chunked into {total_chunks} episodes ({char_length} chars); "
        f"splitting into {batch_count} batches of up to {MAX_CHUNKS_PER_SAGA}"
    )
    last_result = None
@@ -193,9 +254,110 @@ def ingest_to_graphiti(source, full_text, orientation):
    return last_result


+def ingest_single_episode(row_id, source, full_text, orientation):
+    """
+    Single-episode pathway: documents that supersede prior state with
+    medium-or-high confidence. Each chunk is its own POST to /episodes
+    with shared saga tag. Each call independent: own timeout, own retry
+    envelope, own failure semantics.
+
+    Partial-success behavior: if chunk N of total fails, chunks 1..N-1
+    stay committed (graphiti has already accepted them) and the function
+    raises with detail about which chunk failed and how many succeeded.
+    The caller marks the row failed_at with that detail; the operator
+    decides whether to re-enqueue. Re-ingestion will re-POST chunks 1..N-1
+    against the graph; graphiti's dedup will handle them as no-ops.
+
+    Heartbeats stage_3_queue.started_at after each successful chunk so the
+    row doesn't cross the 10-minute stale threshold while actively progressing.
+    """
+    char_length = len(full_text)
+
+    # Short documents: one POST, no chunking, no saga
+    if char_length < SINGLE_EPISODE_THRESHOLD:
+        payload = {
+            "name": source,
+            "content": full_text,
+            "source_description": orientation,
+            "group_id": "aaron",
+            "timestamp": datetime.now().isoformat(),
+        }
+        log.info(f"  [single-ep] Single episode, no chunking ({char_length} chars)")
+        return post_episode(payload, episode_label="single-ep")
+
+    chunks = chunk_text(full_text)
+    total_chunks = len(chunks)
+    log.info(
+        f"  [single-ep] Chunked into {total_chunks} episodes ({char_length} chars); "
+        f"per-chunk POSTs with shared saga"
+    )
+
+    succeeded = 0
+    for i, chunk in enumerate(chunks):
+        chunk_num = i + 1
+        payload = {
+            "name": f"{source} [{chunk_num}/{total_chunks}]",
+            "content": chunk,
+            "source_description": orientation,
+            "group_id": "aaron",
+            "saga": source,
+            "timestamp": datetime.now().isoformat(),
+        }
+        try:
+            post_episode(payload, episode_label=f"chunk {chunk_num}/{total_chunks}")
+            succeeded += 1
+            log.info(f"    chunk {chunk_num}/{total_chunks} committed")
+            heartbeat_row(row_id)
+        except Exception as e:
+            # Annotate the exception with partial-success detail so the
+            # caller can write a clean failure_reason. Re-raise to abort
+            # the document; previously-committed chunks stay in the graph.
+            raise RuntimeError(
+                f"single_episode_partial: chunk {chunk_num}/{total_chunks} failed "
+                f"(succeeded: {succeeded}); error: {str(e)[:300]}"
+            ) from e
+
+    log.info(f"  [single-ep] All {total_chunks} chunks committed")
+    return {"ok": True, "chunks_committed": total_chunks}
+
+
+def should_route_single_episode(supersedes_prior_state, state_type_confidence):
+    """Routing decision for Phase A.
+
+    Single-episode pathway requires BOTH:
+      - supersedes_prior_state is true (Mistral judged it temporally superseding)
+      - confidence is medium or high (Mistral was confident enough to trust)
+
+    Anything else routes to bulk: false supersedes, NULL fields (legacy rows
+    pre-dating Stage 2 v2.2), low confidence even on supersedes=true. This
+    is the safer-cheaper default — bulk skips temporal invalidation, which
+    is the right behavior when we're not confident the content needs it.
+    """
+    if not supersedes_prior_state:
+        return False
+    if state_type_confidence not in HIGH_TRUST_CONFIDENCE:
+        return False
+    return True
+
+
 def process_one(row):
-    row_id, source, full_text, orientation = row
-    log.info(f"Ingesting to Graphiti: {source}")
+    (row_id, source, full_text, orientation,
+     state_type, state_type_confidence, supersedes_prior_state,
+     state_type_rationale) = row
+
+    # Route decision
+    use_single_episode = should_route_single_episode(
+        supersedes_prior_state, state_type_confidence
+    )
+    pathway = "single-episode" if use_single_episode else "bulk"
+
+    log.info(
+        f"Ingesting to Graphiti: {source} "
+        f"[pathway={pathway}, state_type={state_type}, "
+        f"conf={state_type_confidence}, supersedes={supersedes_prior_state}]"
+    )
+    if state_type_rationale:
+        log.info(f"  rationale: {state_type_rationale[:200]}")

    pg = get_pg()
    cur = pg.cursor()
@@ -204,9 +366,16 @@ def process_one(row):
        (row_id,)
    )
    pg.commit()
+    pg.close()

    try:
-        result = ingest_to_graphiti(source, full_text, orientation)
+        if use_single_episode:
+            result = ingest_single_episode(row_id, source, full_text, orientation)
+        else:
+            result = ingest_bulk(source, full_text, orientation)
+
+        pg = get_pg()
+        cur = pg.cursor()
        cur.execute("UPDATE stage_3_queue SET completed_at = NOW() WHERE id = %s", (row_id,))
        pg.commit()
        pg.close()
@@ -214,6 +383,8 @@ def process_one(row):
        return True
    except Exception as e:
        log.error(f"  Graphiti ingest failed for {source}: {e}")
+        pg = get_pg()
+        cur = pg.cursor()
        cur.execute("""
            UPDATE stage_3_queue
            SET failed_at = NOW(), failure_reason = %s
@@ -235,7 +406,9 @@ def run():
            pg = get_pg()
            cur = pg.cursor()
            cur.execute("""
-                SELECT id, source, full_text, orientation
+                SELECT id, source, full_text, orientation,
+                       state_type, state_type_confidence, supersedes_prior_state,
+                       state_type_rationale
                FROM stage_3_queue
                WHERE completed_at IS NULL
                  AND failed_at IS NULL