Adds native FalkorDB vector index support to graphiti-core's FalkorDB
driver. Three patched files (graph_queries.py, falkordb_driver.py,
falkordb/operations/search_ops.py) plus apply.sh that backs up venv
files and copies patches over.
Why this exists: graphiti-core 0.29.0 builds similarity queries using
interpreted Cypher cosine math (vec.cosineDistance) which produces a
full-table scan over Entity/RELATES_TO/Community nodes for every search.
At ~4,000+ entities, single-episode add_episode took 8+ minutes for the
resolve-against-existing-graph step and bulk ingest hung indefinitely.
FalkorDB itself supports db.idx.vector.queryNodes and queryRelationships
procedures backed by HNSW indexes; the driver just doesn't use them.
Patches:
1. graph_queries.py — adds get_vector_indices() returning CREATE VECTOR
INDEX statements for FalkorDB (Entity.name_embedding,
RELATES_TO.fact_embedding, Community.name_embedding). HNSW with
cosine similarity. Adds VECTOR_INDEX_CANDIDATE_MULTIPLIER for
over-fetch when WHERE filters reject some top-k results. Original
get_vector_cosine_func_query preserved for fallback.
2. falkordb_driver.py — extends build_indices_and_constraints() to call
get_vector_indices() alongside range and fulltext. Adds cache
invalidation hook so the search_ops dispatcher re-probes for indexes
after they're built.
3. falkordb/operations/search_ops.py — adds vector-index dispatcher
helpers (_falkordb_vector_index_exists with module-level cache,
_falkordb_vector_node_search_cypher, _falkordb_vector_edge_search_cypher).
Rewrites the three vector-similarity call sites (Entity.name_embedding,
RELATES_TO.fact_embedding, Community.name_embedding) to use
db.idx.vector.queryNodes / queryRelationships when available, fall
back to interpreted-Cypher cosine math when not. Index existence
probed once per (label, attribute, entity_type) and cached.
Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds. Bulk re-ingest of
already-known content (worst case for entity dedup) committed in 60ms.
Activation requires bridging driver._search_ops to driver.search_interface
in the sidecar (see graphiti_service.py). graphiti-core declares
search_interface as the dispatcher attribute but never assigns the
per-driver implementation to it — naming mismatch in their internal
refactor. The bridge is one line in our sidecar's lifespan.
Upstream candidate: this is a known gap (referenced indirectly in
upstream issue #1263 RFC for external vector store overlay). Maintainers'
attention is on Milvus/Qdrant/Pinecone overlay; this is the FalkorDB-
native alternative for users who don't want to run a separate vector DB.
PR after empirical validation in production. Apache-2.0 graphiti-core
source is NOT vendored — backups/ is gitignored to keep the upstream
source out of this repo.
Adds EXTRACTION_INSTRUCTIONS_V1 constant passed to the sidecar via
custom_extraction_instructions on both bulk and single-episode pathways.
graphiti-core inserts the text into entity and edge extraction prompts
only; it does NOT enter dedup prompts (that's the encoder-stays-naive
commitment).
Architectural posture: the encoder is content-naive. It does not draw on
prior knowledge of the user, the substrate, or the cycle's accumulated
work. Schema and personality live in the cycle's consolidated substrate
where the dream phase shapes them. The encoder produces source-grounded
ground truth for the cycle to work from.
Empirical validation in tonight's smoke test: 30+ verb-shaped predicates
from 3 chunks of real content, including IS_AUTOBIOGRAPHICAL_TO,
INFORMED_DESIGN_OF, EVALUATED_DOMAIN_PURITY, DISCONFIRMED_HYPOTHESIS_ABOUT.
Compare to default extraction's 4 predicate types across 22,289 edges.
RELATES_TO appears once as appropriate fallback rather than collapsing
everything generic.
Bumps WORKER_VERSION to 2.4.
Adds graphiti_jobs table for sidecar's async ingest queue and
external_job_id column on stage_3_queue for worker's polling reference.
Tonight's smoke test diagnosed that bulk ingest against the 4,222-entity
graph commits successfully but the worker's 600s HTTP read-timeout fires
before the sidecar's response returns. Three days of 'saga deadlock'
failures were false negatives — the work succeeded; the worker just
stopped listening. Pattern 1 separates submission from completion
observation so the worker can't false-negative this way.
Migration only — sidecar and worker code changes follow in subsequent
commits.
Adds migrations/ directory with README documenting the convention
(timestamped filenames, idempotent SQL, forward-only, single change per file).
First migration is the Stage 3 queue routing columns added live during
Phase A patches today: state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale, plus index on supersedes.
Required by stage2_worker.py >= 2.2 and stage3_worker.py >= 2.3.
Idempotent (IF NOT EXISTS), safe to re-apply. Verified by re-applying
against the live DB — no changes, no errors.
Closes a reproducibility gap: a fresh DB provisioned from git would crash
on first Stage 2 enqueue without these columns. Now the SQL travels with
the code.
Reads new routing columns from stage_3_queue (state_type, state_type_confidence,
supersedes_prior_state, state_type_rationale) and dispatches each row to one of
two ingest pathways:
- BULK pathway (existing, renamed from ingest_to_graphiti to ingest_bulk):
safer-cheaper default. Used when supersedes=false OR confidence=low OR
routing fields are NULL (legacy rows). Skips edge invalidation per
graphiti-core's bulk semantics.
- SINGLE-EPISODE pathway (new, ingest_single_episode): used only when
supersedes_prior_state=true AND confidence in {medium, high}. Per-chunk
POST to /episodes (singular endpoint) with shared saga tag. Each call
independent — own timeout, own retry envelope.
Routing decision isolated in should_route_single_episode() with unit-tested
truth table covering all eight (supersedes × confidence) combinations.
Per-chunk heartbeat (heartbeat_row): single-episode pathway updates
stage_3_queue.started_at after each successful chunk POST so a long-running
document doesn't cross the 10-minute stale threshold mid-process and get
re-dequeued. started_at semantics now: 'last activity timestamp' rather
than 'began at'. Best-effort; failures logged not raised.
Partial-success on chunk failure: previously-committed chunks stay in the
graph; the function raises with detail (single_episode_partial: chunk N/M
failed, succeeded K). The row is marked failed_at with that detail. Re-
ingestion would re-POST chunks 1..N-1 against the graph; graphiti's dedup
handles them as no-ops.
DB connection scoping: process_one no longer holds one Postgres connection
across the whole ingest call (which can run an hour for long single-episode
documents). Each DB write gets a short-lived connection.
Phase A item 3 of three. Closes the mechanical-patches block. Item 4
(custom_extraction_instructions text design) is the remaining intellectual
work; sidecar and worker plumbing is now ready for it.
- BulkEpisodeRequest: new optional custom_extraction_instructions field
with comment noting graphiti-core inserts it into extract_nodes/extract_edges
prompts only, NOT dedupe prompts (verified by reading prompts directory)
- EpisodeRequest: new optional saga field, plumbed through to add_episode
for upcoming Stage 3 single-episode pathway
- Both handlers use conditional kwargs construction so existing callers
see no behavioral change
Phase A item 1 of three. Items 2 (stage2_worker) and 3 (stage3_worker) follow.
Mirrors stage2_worker v2.1 (da98019) resilience fixes:
- Absolute paths for /usr/bin/sudo and /bin/systemctl
- Log stdout/stderr when sidecar restart fails
- Reset consecutive_failures even when wedge recovery fails (prevents
permanent stuck state if restart itself is broken)
Three classes of silent failure converted to clean terminal states:
- Mistral timeout: previously left rows in zombie state (started_at set,
failed_at null, attempts incremented past retry threshold, row invisible
to selection query). Now sets failed_at with reason
'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents
accumulated in this state during the Stage 3 saga deadlock incident.
- Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on
JSON decode failure but process_one wasn't checking, so empty orientation
('Active frames: . Frame relationships: ...') was shipped to Stage 3.
This is F22 from the 2026-04-30 code review. Now sets failed_at with
reason 'mistral_parse_failure'.
- Wedge recovery hammering: consecutive_failures was only reset on
successful Ollama restart. With the sudo path bug (also fixed here),
recovery always failed, so every subsequent failure re-attempted restart.
Now resets the counter regardless and logs the failure visibly.
Also: subprocess.run now uses absolute paths (/usr/bin/sudo,
/bin/systemctl) instead of relying on PATH, fixing the 'No such file or
directory: sudo' error that broke Stage 2's recover_wedge() since
deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the
PATH issue was masking that fix.
Worker version bumped to 2.1 to match Stage 3's resilience patch level.
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.
stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
commits, all sharing the same saga tag for Graphiti document linking.
Original implementation sent all chunks as one saga; 17-19 chunk sagas
deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.
Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
Workers have no sd_notify support; per-request timeouts in code
handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
systemctl restart ollama and restart aaronai-graphiti.service.
Stage 2's existing recover_wedge() was silently broken since
deployment due to this gap.
.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).
Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.
Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.
- api.py: strip CV pinning workaround (parity violation, see architecture doc)
- dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches
3x and filters in-process. Was silently dropping the parameter; would have
confounded E3 with broken cross-stage exclusion in Graphiti arm.
- watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was
propagating through entire cascade. Postgres TEXT can hold up to 1GB.
- corpus_integrity.py: F37 — same truncation, third path now clean.
Backups: api.py.bak.*, dream.py.bak.*, watcher.py.bak.*, ingest.py.bak.*,
corpus_integrity.py.bak.* timestamped pre-fix.
Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected
by F14, 414KB).