f645b74b1c
Major rewrite of the Graphiti sidecar. Two architectural changes:
PATTERN 1 ASYNC JOB MODEL
Submission and completion are decoupled. POST /episodes and
POST /episodes/bulk return job_id immediately; the actual graphiti-core
work happens in a background asyncio task. Submitters poll
GET /jobs/{job_id} until terminal status (committed | failed).
Why: tonight's smoke test confirmed that bulk ingest against the
4,222-entity graph was committing successfully even when the worker's
HTTP read-timeout fired. The synchronous interface was producing
false-negative failures — work succeeded but the worker stopped
listening at the 10-minute read-timeout. Three days of 'saga deadlock'
failures reframe as scaling pathology of unindexed similarity search,
not substrate deadlocks. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.
Architectural commitments:
- One in-flight job per sidecar (per graph). Concurrent jobs against
the same graph would race on graphiti-core's bulk-resolve path (no
transaction boundary). Concurrent multi-tenancy is 'run multiple
sidecars,' not 'make one sidecar concurrency-safe across graphs.'
- Postgres-backed job state. Survives sidecar restart. On startup the
sidecar resets any 'running' rows to 'queued' (their previous run
died); the background worker picks them up naturally.
- Both endpoints async-shaped for parity. Bulk pathway preserved —
load-bearing for first-run corpus migration. Single-episode
preserved — load-bearing for state-superseding content per the
Stage 2/3 routing rule. graphiti-core's add_episode and
add_episode_bulk are unchanged underneath; the async wrapper sits
between the HTTP layer and the library call.
- Polling cadence: 2s flat at the worker, FOR UPDATE SKIP LOCKED so
the design is safe for future multi-sidecar deployment without
changes.
Postgres helpers (_pg, _job_insert, _job_get, _job_claim_next,
_job_complete, _job_fail, _startup_recovery) replace the synchronous
graphiti.add_episode call with persistent job state. Background worker
loop catches everything, logs everything, never dies from an unexpected
error.
SEARCH_INTERFACE BRIDGE
graphiti-core 0.29.0 builds FalkorSearchOperations as
driver._search_ops in FalkorDriver.__init__ but never assigns it to
driver.search_interface. search_utils.py:edge_similarity_search and
node_similarity_search check 'if driver.search_interface:' and
delegate when present, falling through to interpreted-Cypher cosine
math when not. The naming mismatch between the two halves of
graphiti-core means the per-driver implementation never gets used.
Bridge after Graphiti instance construction:
driver.search_interface = driver._search_ops
This activates the per-driver path which (with our vendored patches)
uses db.idx.vector.queryNodes for FalkorDB's native vector index.
Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds.
The bridge is also a candidate for an upstream PR — pick one name and
stick to it across the codebase. Tonight it's local.