stage3_worker: v2.4 — encoder extraction instructions v1.0

Adds EXTRACTION_INSTRUCTIONS_V1 constant passed to the sidecar via custom_extraction_instructions on both bulk and single-episode pathways. graphiti-core inserts the text into entity and edge extraction prompts only; it does NOT enter dedup prompts (that's the encoder-stays-naive commitment). Architectural posture: the encoder is content-naive. It does not draw on prior knowledge of the user, the substrate, or the cycle's accumulated work. Schema and personality live in the cycle's consolidated substrate where the dream phase shapes them. The encoder produces source-grounded ground truth for the cycle to work from. Empirical validation in tonight's smoke test: 30+ verb-shaped predicates from 3 chunks of real content, including IS_AUTOBIOGRAPHICAL_TO, INFORMED_DESIGN_OF, EVALUATED_DOMAIN_PURITY, DISCONFIRMED_HYPOTHESIS_ABOUT. Compare to default extraction's 4 predicate types across 22,289 edges. RELATES_TO appears once as appropriate fallback rather than collapsing everything generic. Bumps WORKER_VERSION to 2.4.
2026-05-02 05:15:17 +00:00
parent a0bf280075
commit d7b2a850c4
1 changed files with 118 additions and 6 deletions
@@ -1,6 +1,7 @@
 #!/usr/bin/env python3
 """
 Stage 3 Worker — Graphiti Ingest with Bulk-vs-Single-Episode Routing
                  + Encoder Instructions (v1.0)
 Polls stage_3_queue, routes each row to one of two ingest pathways based on
 state-type classification produced by Stage 2:
@@ -12,6 +13,18 @@ state-type classification produced by Stage 2:
    confidence in {medium, high}. Per-chunk POST to /episodes with shared
    saga tag, full edge invalidation, per-chunk timeout/retry independence.
 Both pathways pass EXTRACTION_INSTRUCTIONS_V1 to the sidecar via
 custom_extraction_instructions, which graphiti-core inserts into entity
 and edge extraction prompts (NOT dedup prompts — that's intentional under
 the encoder-stays-naive commitment).
 Architectural posture: the encoder is content-naïve. It does not draw on
 prior knowledge of the user, the substrate, or the cycle's accumulated
 work. Schema and personality live in the cycle's consolidated substrate,
 where the dream phase shapes them. The encoder produces source-grounded
 ground truth for the cycle to work from. See EXTRACTION_INSTRUCTIONS_V1
 below for the extraction guidance text.
 Routing rationale: the single-episode pathway is the correct API per
 graphiti-core's docs for content that supersedes prior facts (it does
 edge invalidation that bulk skips). It costs more per chunk because of
@@ -67,7 +80,7 @@ HEARTBEAT_FILE = Path("/var/log/aaronai/stage3-heartbeat")
 RETRY_ATTEMPTS = 2
 POLL_INTERVAL = 5
 INGEST_TIMEOUT = 600
-WORKER_VERSION = "2.3"
+WORKER_VERSION = "2.4"
 # Match Stage 1 chunking parameters
 CHUNK_SIZE_WORDS = 500
@@ -84,6 +97,87 @@ MAX_CHUNKS_PER_SAGA = 10
 # the expensive pathway.
 HIGH_TRUST_CONFIDENCE = ("medium", "high")
 # Encoder extraction guidance v1.0 — see module docstring for posture rationale.
 # Passed to graphiti-core via custom_extraction_instructions on both ingest
 # pathways. Inserted into entity-extraction and edge-extraction prompts only;
 # does NOT enter dedup prompts. Encoder-stays-naïve commitment is structural,
 # not versioned: this text gets refined over time but the encoder does not
 # acquire substrate context as the cycle matures.
 EXTRACTION_INSTRUCTIONS_V1 = """\
 EXTRACTION GUIDANCE — BirdAI cascade
 The encoder's job is faithful capture from this chunk's text. It does
 not draw on prior knowledge of the user, the substrate, or the cycle's
 accumulated work. Schema, personality, and inferred context live in
 the cycle's consolidated substrate, where the dream phase shapes them
 through prediction-error replay and speculation. The encoder stays
 content-naïve so the cycle has source-grounded ground truth to work
 from.
 The orientation produced by an upstream pass describes content shape,
 not content interpretation. Use it as forward-facing guidance for what
 to attend to in this document. Do not let it bound or limit what you
 extract.
 PREDICATE NAMING
 Produce semantic predicates that describe the actual relationship the
 text states. Use verbs or verb phrases — "wrote", "advised", "founded",
 "works at", "led to", "contradicts", "is autobiographical to" — not
 generic placeholders. Reserve generic forms (for example, "relates to"
 or "mentions") for cases where the text genuinely does not specify a
 more particular relationship. The verb is the load-bearing part of
 the fact; preserving it is what makes the relationship queryable later.
 EXTRACTION POSTURE
 Extract from this chunk's text as if each entity is encountered fresh.
 Do not try to reconcile entities you find here with entities that
 might already exist elsewhere in the graph. Redundant entity instances
 are acceptable. Cross-document entity resolution is downstream cycle
 work, not extraction work.
 When the same entity appears multiple times within this chunk with
 slightly different spellings — a common artifact of voice transcription —
 prefer the more frequent or more canonical-looking form. Do not invent
 canonical forms; choose among the variants the text actually contains.
 EXTRACT FROM THE SOURCE
 Extract relationships the text states or strongly implies through
 direct linguistic markers ("X led to Y", "X works for Y", "X met Y at
 Z"). Do not extend extraction to relationships the text neither states
 nor directly implies. Inferred relationships are produced by the
 cycle's dream phase as speculative edges with explicit low-confidence
 tagging, where they can be evaluated and either ratified or pruned by
 subsequent cycle work. Encoding-time inference, mixed in with source-
 grounded extraction, would lose the speculation/source distinction the
 cycle's consolidation work relies on.
 DO NOT PRE-EMPT CYCLE WORK
 Do not omit relationships because they seem redundant with prior
 extractions or with the existing graph. Cross-document entity
 resolution and edge consolidation are downstream cycle operations;
 redundant extraction at this stage is intentional. Extracting the
 same fact from multiple sources gives the cycle's consolidation work
 the recurrence signal it relies on.
 EXTRACTION DEPTH
 Use the orientation's frame_relationships and extraction_orientation
 fields to inform what to attend to. If the orientation describes
 cross-domain relational content, look for relationships that bridge
 those domains explicitly, with named predicates for the bridging.
 If the orientation describes single-domain technical content, look
 for the structural relationships internal to that domain.
 Extract every entity and every relationship the text states. Do not
 summarize, do not filter, do not omit content because it seems
 incidental. The orientation tells you what to look for; the source
 text tells you what is there.
 """
 def get_pg():
    return psycopg2.connect(PG_DSN)
@@ -193,6 +287,8 @@ def ingest_bulk(source, full_text, orientation):
    - Large documents (chunks > MAX_CHUNKS_PER_SAGA): split into batches of
      MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga
      tag so Graphiti links them as one document unit
    All three sub-paths pass EXTRACTION_INSTRUCTIONS_V1 to the sidecar.
    """
    char_length = len(full_text)
@@ -204,7 +300,11 @@ def ingest_bulk(source, full_text, orientation):
            "timestamp": datetime.now().isoformat(),
        }]
        log.info(f"  [bulk] Single episode ({char_length} chars)")
-        return post_bulk({"episodes": episodes, "group_id": "aaron"})
+        return post_bulk({
            "episodes": episodes,
            "group_id": "aaron",
            "custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
        })
    chunks = chunk_text(full_text)
    total_chunks = len(chunks)
@@ -220,9 +320,12 @@ def ingest_bulk(source, full_text, orientation):
            for i, chunk in enumerate(chunks)
        ]
        log.info(f"  [bulk] Chunked into {total_chunks} episodes ({char_length} chars)")
-        return post_bulk(
+        return post_bulk({
-            {"episodes": episodes, "group_id": "aaron", "saga": source}
+            "episodes": episodes,
-        )
+            "group_id": "aaron",
            "saga": source,
            "custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
        })
    # Large document: split into batches sharing the same saga tag
    batch_count = (total_chunks + MAX_CHUNKS_PER_SAGA - 1) // MAX_CHUNKS_PER_SAGA
@@ -247,7 +350,12 @@ def ingest_bulk(source, full_text, orientation):
        batch_label = f"batch {batch_idx + 1}/{batch_count} (chunks {start + 1}-{end})"
        log.info(f"    {batch_label} starting")
        last_result = post_bulk(
-            {"episodes": episodes, "group_id": "aaron", "saga": source},
+            {
                "episodes": episodes,
                "group_id": "aaron",
                "saga": source,
                "custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
            },
            batch_label=batch_label,
        )
        log.info(f"    {batch_label} committed")
@@ -261,6 +369,8 @@ def ingest_single_episode(row_id, source, full_text, orientation):
    with shared saga tag. Each call independent: own timeout, own retry
    envelope, own failure semantics.
    Each chunk POST passes EXTRACTION_INSTRUCTIONS_V1 to the sidecar.
    Partial-success behavior: if chunk N of total fails, chunks 1..N-1
    stay committed (graphiti has already accepted them) and the function
    raises with detail about which chunk failed and how many succeeded.
@@ -281,6 +391,7 @@ def ingest_single_episode(row_id, source, full_text, orientation):
            "source_description": orientation,
            "group_id": "aaron",
            "timestamp": datetime.now().isoformat(),
            "custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
        }
        log.info(f"  [single-ep] Single episode, no chunking ({char_length} chars)")
        return post_episode(payload, episode_label="single-ep")
@@ -302,6 +413,7 @@ def ingest_single_episode(row_id, source, full_text, orientation):
            "group_id": "aaron",
            "saga": source,
            "timestamp": datetime.now().isoformat(),
            "custom_extraction_instructions": EXTRACTION_INSTRUCTIONS_V1,
        }
        try:
            post_episode(payload, episode_label=f"chunk {chunk_num}/{total_chunks}")