Compare commits

..

46 Commits

Author SHA1 Message Date
aaron 5582549321 dream_observation: drop the 'go quiet' rule from select_mode
The earlier behavior never went quiet — it dreamed every night, even when
that meant repeating itself. The 'return None on null delta' rule was a
synthesis-doc invention (the dreamer-design-spec.md I treated as
authoritative is itself LLM-generated) that didn't match the actual
desired UX. Aaron called this out.

The repetition problem the quiet rule was claimed to solve is already
addressed in the retrieve layer:
- LLM-generated queries from the observation signal vary nightly
- MMR diversity prevents within-night cluster lock-in
- NREM bias toward under-processed chunks (low consolidation_count)
  ensures fresh material gets selected over recently-replayed material

So select_mode now always returns a mode. NREM is the default. Staleness
still routes to Late REM at 3+ days for cross-domain variety. Journal
entries still route to Early REM.
2026-05-22 23:49:27 +00:00
aaron 3ec9a48151 dream_observation: reorder select_mode so 3-day staleness wins over the quiet rule
Bug: the previous order checked the "nothing changed → return None" rule
first, so the spec's "corpus unchanged 3+ days → Late REM (shake things
loose)" branch could never fire. Stasis was permanent — quiet would just
keep returning None forever as long as no new chunks or journals appeared,
regardless of how stale the corpus got.

Fix: check staleness first. Quiet remains the default within the 1-2-day
window the spec implicitly grants for the dreamer to "go quiet rather than
manufacturing novelty." At day 3+, Late REM fires automatically — the
spec's mechanism for breaking out of the silence when the corpus isn't
delivering new material.

Observed symptom that triggered this: dreamer fired 2026-05-21 08:00 and
2026-05-22 08:00, both went quiet. Real cause was no new content (which
is correct quiet behavior for days 1-2), but the bug would have made it
stay quiet indefinitely had we not fixed it before day 3.
2026-05-22 23:18:00 +00:00
aaron 9d09d3fa14 api.py: flush=True on graphiti-push log lines
The background daemon thread that pushes chat turns to Graphiti was using
default-buffered print(), so the success/failure lines never reached the
systemd journal — buffer never flushed because the thread keeps the
interpreter alive. The push itself worked (verified by Episodic nodes
appearing in the graph), just the log was silent.

Surgical fix: pass flush=True on the four print() calls inside _push_chat_turn_
to_graphiti's background worker. Now every push result lands in the journal
as it happens, giving real-time visibility into whether pushes are
succeeding, failing on non-200, hitting a network error, or raising
unexpectedly.

If we add more background-thread logging later, PYTHONUNBUFFERED=1 in the
service environment would solve it globally — but that's overkill for this
one site.
2026-05-20 22:41:02 +00:00
aaron f185ed60cb dream.py: Stage 3+ refactor — LLM-generated queries, MMR, mutable windows, consolidation cursor
Implements the rest of dreamer-design-spec.md's Stage 3 alongside the
prescriptions from the external literature review:

- Hardcoded seed query strings are gone. _llm_generate_queries() produces
  4 mode-appropriate retrieval queries per call from the observation signal
  (Park et al. 2023 reflection pattern). NREM queries probe RECENT additions;
  Early REM bridges associative/emotional threads; Late REM forces cross-
  domain pairs; Lucid decomposes the task. Empirical first-run output:
  queries like "SUNY New Paltz Fall 2026 registration moratorium" instead of
  the fixed "research fabrication teaching practice recent work" — vector
  neighborhood now drifts with what the user has been actually doing.

- TIME_WINDOWS_HOURS makes per-mode retrieval windows mutable
  (dreamer-multimodal-design.md §2's tech-debt item): NREM 72hr / Early REM
  30d / Late REM 90d / Lucid no-window. NULL created_at rows are excluded
  from windowed modes — correct since they predate the cursor by definition.

- NREM bias toward under-processed chunks via "ORDER BY consolidation_count
  ASC" before vector distance. Biologically motivated: sharp-wave-ripple
  replay is tagged/biased, not uniform. Chunks that haven't been replayed
  recently win the tiebreak.

- MMR merge (Carbonell & Goldstein 1998) over the union of all queries'
  candidates. λ=0.5. Directly attacks the cluster-dominance failure mode
  where 8 dossier-narrative variants filled all 8 slots in 5 consecutive
  nights.

- _bump_consolidation_cursor() called after NREM completes. Each source
  used gets consolidation_count += 1 and last_consolidated_at = NOW().
  Tomorrow's signal sees these as more-processed, less under-processed.

- dream_pipeline now runs observe_corpus + select_mode at the top per spec
  lines 27-34. If select_mode returns None — corpus unchanged + no new
  journal entry — pipeline exits with no dream rather than manufacturing
  novelty (spec line 67's "dreamer goes quiet").

Back-compat preserved:
- retrieve()'s signature gains `signal` as optional kwarg; default behavior
  calls observe_corpus() inline so dream_single / dream_lucid keep working
  unchanged.
- Graphiti substrate (E3 experiment) path untouched.
- Manifest schema keeps the "query" field; value is now
  "[llm-generated from observation signal]" so historical manifest
  consumers don't break.
2026-05-20 18:11:07 +00:00
aaron a4735053c2 backfill_consolidation_cursor.py: populate cursor from historical dream manifests
One-off script. Walks Journal/Dreams/dream-manifest-*.json and increments
consolidation_count + sets last_consolidated_at for every (manifest, source)
pair. Idempotent — resets the cursor for any touched sources before
backfilling, so reruns don't double-count.

First run: 7547 embeddings rows updated across 105 unique sources, 416
(source, manifest_date) pairs across all manifests. Distribution now: 422
chunks at count=18 (the dominant dossier-narrative cluster that fills every
NREM in the last 18 days), long tail down to count=1, 12,011 still at 0.

This makes dream_observation.underprocessed_count meaningful — before, all
counts were 0 so the bottom-quartile percentile was 0 and the signal was
degenerate. After, the signal correctly identifies the 12k chunks that have
never been replayed.
2026-05-20 18:04:43 +00:00
aaron f682d8c6a0 dream_observation.py: Stage 1 + 2 of the design spec — observe and select
Implements `dreamer-design-spec.md` lines 27-74: observe_corpus() returns a
signal vector (new_chunks delta, new_journal_entries, recent_questions over
14-day window, days_since_dream, underprocessed_count derived from the new
consolidation cursor); select_mode() returns one of {nrem, early-rem,
late-rem, lucid} or None per the spec's rules. The None return is the spec's
canonical answer to the repetition problem (line 67) — "dreamer goes quiet
rather than manufacturing novelty."

Standalone for now. Not wired into dream_pipeline yet — that happens in the
retrieve() refactor (task #46). dream.py is unchanged in this commit.

Grounded sources cited in module docstring: Friston Active Inference, sleep
research (Stickgold/Walker/Diekelberg & Born), sharp-wave ripples (Buzsáki).
All three appear in BirdAI-Bibliography.md.

Migration prerequisite (already shipped in the prior commit): consolidation
cursor columns last_consolidated_at + consolidation_count added to
embeddings. Backfill from dream-manifest history is task #49.
2026-05-20 17:57:38 +00:00
aaron 151c756b89 api.py: async chat-turn push to Graphiti
After chat() returns, fire-and-forget background thread POSTs the (user
message + assistant response) as one episode to /episodes. Default extraction
(Sonnet). Errors logged, never raised — chat is not gated on the write.

Wall-clock cost in the background is ~20 min per episode against the
current ~4,300-entity graph. The chat experience is unaffected; the graph
catches up with a delay. Search_facts queries reflect new turns once the
sidecar has finished processing them.

Kill-switch: SKIP_GRAPHITI_CHAT_PUSH=1 in the api service environment
disables the push without code changes. Useful if dedup contention surfaces
under sustained load.

Companions to this commit: search_facts tool (e96bf40), orientation indexer
worker (e96bf40), FalkorDB vector index patches (d2ec20e, 313c0f0).
2026-05-20 05:08:07 +00:00
aaron e96bf40b2f plan B: search_facts chat tool + orientation indexer (read-only Graphiti)
After establishing that single-episode Graphiti writes take ~20 min against
the existing graph (the dedup loop is structurally slow regardless of the
patches, the bridge, or the LLM model), the salvage plan is to stop trying
to write to Graphiti and instead:

  1. Use the existing 4,300-entity graph as a read-only fact layer at chat
     time via a new search_facts tool. Graphiti's /search endpoint is fast
     (~15ms direct, ~400ms over HTTP); the graph is stale-as-of-early-May
     but covers most biographical / relational content that "write me a bio"
     and similar queries care about.

  2. Pipe Stage 2's document-level orientations into pgvector via a new
     orientation_indexer worker. Stage 2 already runs and writes orientation
     text to stage_3_queue for every Mistral-processed document; the worker
     reads those, embeds them, and writes one row per source to embeddings
     with metadata->>'kind'='orientation'. retrieve_documents now ranks
     against both chunk text and document-level concept summaries.

Idempotent: the indexer's "is this already indexed" check is an EXISTS
subquery against embeddings, so restarts and partial runs are safe.

Out of scope (deliberately): no Graphiti writes from chat, no Stage 2 ->
Graphiti bridge, no draining the 711-item stage_3_queue backlog into
Graphiti. Rich-extraction posture stays a BirdAI concern.
2026-05-20 05:00:03 +00:00
aaron 313c0f0341 graphiti_service.py: bridge driver._search_ops to driver.search_interface
graphiti-core 0.29.0 builds FalkorSearchOperations as driver._search_ops in
FalkorDriver.__init__ but never assigns it to driver.search_interface.
search_utils.py dispatches on search_interface; without this one-line bridge
it falls back to interpreted-Cypher cosine math doing full table scans for
every entity dedup similarity check.

Combined with the vendored patches in graphiti_patches/ (restored in the
previous commit d2ec20e), this activates FalkorDB's native vector index for
the dedup similarity path. Empirical impact (per the original f645b74 commit
message): single-episode add_episode against a ~4,277-entity graph went from
indefinite hang to ~8.2 seconds.

Surgical restore: cherry-picks only the bridge code from f645b74 — not the
Pattern 1 async job model, not the v2.4 extraction instructions, neither of
which we want. Default extraction posture (taxonomy-naïve) stays the
operating mode. Rich-extraction story remains a BirdAI concern.
2026-05-20 04:06:46 +00:00
aaron d2ec20e373 graphiti_patches: vendored FalkorDB vector index support for graphiti-core 0.29.0
Adds native FalkorDB vector index support to graphiti-core's FalkorDB
driver. Three patched files (graph_queries.py, falkordb_driver.py,
falkordb/operations/search_ops.py) plus apply.sh that backs up venv
files and copies patches over.

Why this exists: graphiti-core 0.29.0 builds similarity queries using
interpreted Cypher cosine math (vec.cosineDistance) which produces a
full-table scan over Entity/RELATES_TO/Community nodes for every search.
At ~4,000+ entities, single-episode add_episode took 8+ minutes for the
resolve-against-existing-graph step and bulk ingest hung indefinitely.
FalkorDB itself supports db.idx.vector.queryNodes and queryRelationships
procedures backed by HNSW indexes; the driver just doesn't use them.

Patches:

1. graph_queries.py — adds get_vector_indices() returning CREATE VECTOR
   INDEX statements for FalkorDB (Entity.name_embedding,
   RELATES_TO.fact_embedding, Community.name_embedding). HNSW with
   cosine similarity. Adds VECTOR_INDEX_CANDIDATE_MULTIPLIER for
   over-fetch when WHERE filters reject some top-k results. Original
   get_vector_cosine_func_query preserved for fallback.

2. falkordb_driver.py — extends build_indices_and_constraints() to call
   get_vector_indices() alongside range and fulltext. Adds cache
   invalidation hook so the search_ops dispatcher re-probes for indexes
   after they're built.

3. falkordb/operations/search_ops.py — adds vector-index dispatcher
   helpers (_falkordb_vector_index_exists with module-level cache,
   _falkordb_vector_node_search_cypher, _falkordb_vector_edge_search_cypher).
   Rewrites the three vector-similarity call sites (Entity.name_embedding,
   RELATES_TO.fact_embedding, Community.name_embedding) to use
   db.idx.vector.queryNodes / queryRelationships when available, fall
   back to interpreted-Cypher cosine math when not. Index existence
   probed once per (label, attribute, entity_type) and cached.

Empirical result: single-episode add_episode against a 4,277-entity
graph went from indefinite hang to 8.2 seconds. Bulk re-ingest of
already-known content (worst case for entity dedup) committed in 60ms.

Activation requires bridging driver._search_ops to driver.search_interface
in the sidecar (see graphiti_service.py). graphiti-core declares
search_interface as the dispatcher attribute but never assigns the
per-driver implementation to it — naming mismatch in their internal
refactor. The bridge is one line in our sidecar's lifespan.

Upstream candidate: this is a known gap (referenced indirectly in
upstream issue #1263 RFC for external vector store overlay). Maintainers'
attention is on Milvus/Qdrant/Pinecone overlay; this is the FalkorDB-
native alternative for users who don't want to run a separate vector DB.
PR after empirical validation in production. Apache-2.0 graphiti-core
source is NOT vendored — backups/ is gitignored to keep the upstream
source out of this repo.
2026-05-20 04:04:24 +00:00
aaron 10bb29290a watcher: handle deletes; sweep_orphans cleans existing phantom chunks
watcher.py now listens for on_deleted events and treats on_moved
destinations that fall outside NEXTCLOUD_PATH (Nextcloud trashbin, moves
to other volumes) as deletes. Both cases call delete_embeddings_for_path
(DELETE WHERE metadata.filepath = ...) and remove_from_state to drop the
file from watcher_state.json so it isn't carried as known-mtime.

Match is by metadata.filepath, not source basename, so files that share a
name across folders don't collide.

scripts/sweep_orphans.py is the one-time cleanup for chunks the watcher
missed before this fix:
- Modern pass: rows with metadata.filepath whose file no longer exists.
- Legacy pass: rows with NULL filepath and type='document' whose basename
  isn't anywhere on disk. type='document' restriction skips conversations
  and memory snapshots (synthetic sources, not files on disk).

First run cleaned 629 rows: 628 from moved-file duplicates (e.g., BirdAI
docs that traveled across Journal/, Library/, Journal/Projects/BirdAI/)
plus the AARON_NELSON_BIO.pdf phantom Aaron flagged.
2026-05-20 02:52:00 +00:00
aaron 9bb083f065 chat: cap retrieve_documents per turn, truncate displayed citations, broaden lock-file skip
- MAX_RETRIEVALS_PER_TURN (5): after five retrieve_documents calls in a single
  turn, further calls return a budget-exhausted message instead of executing.
  Caps cost on runaway multi-query loops without forbidding compound questions.

- MAX_CITED_SOURCES (5): accumulated_sources was growing to 14+ entries across
  multiple tool calls and showing chunks Claude never actually used. Cap the
  list returned to the UI at 5, preserving insertion order so the
  highest-relevance early-call results survive. Proper fix (Claude-driven
  inline citations) is bigger work, noted for later.

- ingest.py lock-file skip: changed prefix tuple from ("~$", ".") to ("~", ".")
  so it catches Office lock files even when Nextcloud's filesystem encoding has
  mangled the "$" into a unicode replacement char. Matches what watcher.py
  already does.
2026-05-20 02:22:54 +00:00
aaron 430ea239dd api.py: drop save_document preview escape hatch — two-turn separation now unconditional
Previous prompt let Aaron skip the preview if he asked up front. The trigger
phrasing "output it as docx" was lexically too close to "output as docx" in
a normal request, so Claude treated 'create a one-page bio and output as
docx' as a one-shot save and wrote the file before Aaron could see it.
Removed the escape hatch. Draft-then-commit is now the only flow.
2026-05-20 01:06:40 +00:00
aaron 0a1e2b4f61 api.py: preview-then-commit flow for save_document
The previous system prompt instructed Claude to skip duplicating document
content in chat and write the file directly. That produced no-preview UX:
the user asked for a bio and the docx appeared in Drafts/ before they had
a chance to read or refine it. Reversed: Claude now drafts in chat first,
waits for an explicit save signal, and only then calls save_document. The
explicit "skip preview" escape hatch is preserved for one-shot flows.
2026-05-20 01:01:45 +00:00
aaron 8c2c597687 api.py: save_document — distinguish PATH miss from missing install in error
The systemd unit pins PATH to the venv only, so subprocess.run(['pandoc', ...])
raised FileNotFoundError even though pandoc was installed at /usr/bin/pandoc.
The handler's "pandoc not installed" message was misleading — pandoc was
reachable from a login shell but not from the service. Rephrased to point at
the actual cause: the service's PATH. The systemd drop-in to extend PATH is
not committed here (lives at /etc/systemd/system/aaronai.service.d/path.conf
on the host).
2026-05-20 00:51:41 +00:00
aaron fda61ad622 api.py: save_document tool — pandoc render to Nextcloud Drafts/ via WebDAV
Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he
asks for a document (bio, cover letter, statement, CV section) rather than
chat text. Pandoc handles markdown -> docx and markdown -> pdf with the
xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance
dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD
in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision
check appends _2, _3, ... until unique. Filename sanitization strips path
components and unsafe characters.

System prompt instructs Claude to call save_document when the user wants a
file (not chat text) and not to duplicate the file contents in the chat
response — just write the file and tell Aaron where it landed.

ingest.py and watcher.py now skip files under Drafts/ at ingest time so
generated drafts don't pollute future retrieval. Drafts can still be opened,
edited, and shipped; they just don't become part of the searchable corpus
unless Aaron explicitly moves them out of Drafts/.
2026-05-20 00:41:26 +00:00
aaron 84994f9282 api.py: prompt-cache system prompt and memory across tool_use round-trip
Move persistent memory from the user message into system blocks with
cache_control: ephemeral on the last block. The static prefix (system prompt +
memory, ~3-5K tokens typically) is identical between the two LLM calls of a
tool_use round-trip and stable across turns within the 5-minute cache TTL.

Without this, the tool-call retrieval architecture roughly doubled input
token cost on retrieval-needed turns (full context billed twice). With cache
reads at ~10% of standard input, the duplication cost drops by ~90% — the
"twice as expensive" hit becomes "slightly more expensive plus tool overhead."

client_time stays in the user message (per-turn dynamic, should not be in the
cached prefix).
2026-05-19 23:13:43 +00:00
aaron 9e86297e2a api.py: tool-call retrieval, drop the keyword intent classifier
Removes classify_retrieval_intent and the type/folder filter parameters on
retrieve_context. The keyword classifier was the same anti-pattern as the
formatting-driven docx chunker: a heuristic that locks the user into specific
phrasings and fails silently on anything novel. A scope enum (personal /
library / conversations / memory) would have been the same heuristic in a
fancier wrapper — the categories themselves are mine, not Aaron's.

New shape: a retrieve_documents tool exposed to Claude. Tool takes a single
query argument; the model decides when to call it, what to search for, and
how many times per turn (multi-query falls out naturally for compound asks).
Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt,
but corpus content is fetched on demand by the model with concrete queries
it crafts itself, not the user's raw phrasing.

retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup,
no filters. The reranker ranks, the model judges relevance. When ranking
fails (e.g. abstract instructional queries pulling philosophy books), the
right fix is a better reranker, not another query-time taxonomy. That work
is acknowledged but deferred.

System prompt updated to teach the model about the tool and to prefer
concrete tokens (named entities, project names, course codes) over abstract
phrasing when constructing search queries.
2026-05-19 23:05:25 +00:00
aaron 9955c7e383 encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak
extract_blocks(filepath) is the new structured-extraction entry point, returning
list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk
back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading
prepended for retrieval context and stored in metadata).

- pptx: one block per slide. Slide title becomes block heading; speaker notes
  fold into the body. Image-only decks with title-only slides now produce
  heading-only chunks instead of being recorded as extraction failures.
- docx: deliberately single-block (back-compat). Heading-style section detection
  was implemented and rolled back: hand-formatted CVs are Normal-styled with
  bold-as-heading, and tying chunk boundaries to formatting choices would lock
  future-user into preserving those choices forever. Lexical + cross-encoder
  retrieval already handles substring matching inside blind-chunked CVs.
- pdf/txt/md: unchanged (single block, blind chunking).

Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it
as secondary sort key in _rerank so memory/journal snapshots prefer the latest
copy among near-duplicate content.

reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a
subset; previous hardcoded delete regex would have wiped both even with a
single-ext target.
2026-05-19 21:58:25 +00:00
aaron 50b97e2998 api.py: folder-aware retrieval, near-duplicate dedup, folder in citations
Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:

- Library/personal split. classify_retrieval_intent now returns
  (type_filter, folder_exclude_prefixes). Biographical document intent excludes
  Library/* so philosophy/cognition books stop crowding out CVs and dossiers
  for queries like "write me a bio".

- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
  Teaching Philosophy.pdf in different application folders) used to fill the
  top-N with the same content. Dedup by first-300-chars hash after rerank.

- Folder in source citations. Surface metadata.folder alongside basename so
  the LLM can disambiguate among 21 CV.docx variants and the user can see
  which copy a citation refers to.

Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
2026-05-19 21:35:28 +00:00
aaron 8d560f9f5e api.py: hybrid retrieval with intent routing and cross-encoder rerank
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
  fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
  about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
  taking final top-N

Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.

Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).
2026-05-19 21:11:15 +00:00
aaron 732e450d21 Stop silent data loss in voice capture pipeline
Empty transcripts and transcription failures previously
deleted the temp audio and returned without writing any
record to disk — violating parity-at-encode (raw content
is episodic context, not noise).

- Preserve audio in Journal/Media/YYYY-MM/ on all paths
  (success, empty, failure) instead of unlinking.
- Write a markdown entry to Journal/Captures/ on failure
  paths with status, audio_path, and error fields.
- Add status: saved to successful captures so frontmatter
  is uniform across success and failure.
- Fire SSE capture_saved events on all terminal paths,
  with status included.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:41:51 +00:00
aaron 63c58b5bb3 Extend session lifetime to 365 days
Single-user personal app threat model is theft-of-device, not
stolen-cookie. 30-day idle re-prompts created friction without
proportional security benefit. Server TTL and client max-age
remain in sync via shared constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:29:38 +00:00
aaron 6c2af55e7e Server-side session TTL enforcement
- session_exists() now rejects rows older than 30 days,
  matching the client cookie max-age.
- Opportunistic cleanup of expired rows on session_exists()
  calls, preventing unbounded growth of sessions.db from
  orphaned tokens (PWA reinstalls, manual cookie clears).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:28:39 +00:00
aaron 5b4a299414 encoding.py: write_embeddings_batch accepts commit parameter for transactional composition
Adds an optional commit=True parameter to write_embeddings_batch. When True
(default, matching prior behavior), the function commits the connection
after the per-row UPSERT loop. When False, the caller manages the
transaction.

This unblocks fix #1 (pgvector-bypass paths) and fix #2 (watcher
two-transaction pattern), both of which need to compose embeddings writes
with other database writes in the same transaction. Without this lever,
either fix would require duplicating the UPSERT logic outside this helper
or introducing a second commit boundary inside an otherwise atomic
operation.

No behavior change for existing callers — they all use the default
commit=True and continue working unchanged.
2026-05-05 02:52:33 +00:00
aaron b09e35892c encoding.py: strip frontmatter from .md at extraction time
The capture endpoint (api.py:702, 833) writes Journal/Captures/*.md
files with a markdown-bold-style header block (`**type:** voice`,
`**modality:** audio`, `**status:** unprocessed`, optional `**media:**`
and `**project:**`) followed by a `---` separator. extract_text for .md
was a bare filepath.read_text, so every capture-derived chunk in
pgvector embedded the frontmatter as raw text, polluting retrieval.

Fix adds _strip_md_frontmatter, called only for the .md branch:

- Capture-style: optional leading H1 (preserved), then consecutive
  `**key:** value` lines (and blanks), terminated by `---`. The H1 is
  retained; the key/value block + separator are removed.
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
  Only triggered when no heading precedes — guards against the common
  `# Title` + `---` (horizontal rule under heading) pattern seen in
  Journal/aaronai-architecture.md and four other Journal/*.md files.

Body `**bold:**` lines (e.g. `**Visual description:**` in image
captures) and body `---` horizontal rules are never touched: the scan
aborts as soon as a non-frontmatter line appears in the leading block.

briefing_generator_v2.py's split("---", 1) heuristic was reviewed and
not reused — fragile on substring matches and on documents with
multiple `---` rules.

Verified against:
- 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1
  retained.
- 2026-04-27-04-34-image.md: frontmatter stripped, `**Visual
  description:**` and `**Voice annotation:**` body bold-headers
  retained, trailing `---` not consumed.
- Journal/aaronai-architecture.md (5 body `---` rules): output
  byte-identical to read_text (96101 chars).
- Synthetic YAML doc: stripped correctly when no leading heading.
- Synthetic plain markdown with body `---` rules: untouched.
- Empty input + heading-only file: untouched.

Existing capture chunks in pgvector retain polluted text; the fix only
affects future extractions. Backfill decision deferred — the cleanest
path is `touch -h Journal/Captures/*.md` to bump mtime and let the
watcher re-ingest naturally on the next cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:20:55 +00:00
aaron e38d283e59 watcher.py: exclude 3 image-only pptx files from ingestion
Three files in the original ingest_failures cohort have been
characterized via direct OCR and confirmed to lack ingestible text:

- Presentations/Renders.pptx — 35 PICTURE-shape renders, 33/35 zero-char
  on OCR, 2 with noise (20 and 29 chars).
- Presentations/Ribbon Cutting Slideshow.pptx — 10-slide event photo
  deck, 9/10 zero-char, 1 with 17 chars of noise.
- Academic/DDF555 3D Computational/GH Slicer Notes [Autosaved].pptx —
  Office autosave duplicate of GH Slicer Notes.pptx; first 9 images
  byte-identical (sha256) to the canonical file. 2 net-new images
  contribute 36 noisy chars. Excluding to prevent double-embedding the
  same content under two source filenames.

Pattern matches f18fb64 (path.parts membership). Folder-level globs
were considered and rejected: /Presentations/ contains successfully
embedded text-bearing decks (aaronnelson_3D 4D.pptx,
aaronnelson_slideslam.pptx). Exact-name + parent-folder membership
applied in both watcher filter sites (get_changed_files and
IngestHandler._should_ignore).

The fourth file in the cohort, GH Slicer Notes.pptx (the canonical
non-autosave version), was confirmed to carry 379 chars of real text
(Grasshopper UI / code samples) across 6/9 images. It remains in
ingest_failures unresolved, awaiting the eventual ocrmypdf backlog
pass.

Cleanup: 3 ingest_failures rows resolved (the excluded files).
Unresolved count: 94 → 91.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 01:42:40 +00:00
aaron 8e61e4dedb docs: OCR install record for 2026-05-04
Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).

This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.

Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
  in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
  tried first but contains only rendered designs with no text — noted
  as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
  Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
  Real readable English; usable as the reference timing for the
  eventual async worker queue.

Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.
2026-05-04 16:58:30 +00:00
aaron 7b77794319 api.py: enable PRAGMA foreign_keys=ON in _connect helper; clean up 2 message orphans
The messages table declares FOREIGN KEY (conversation_id) REFERENCES
conversations(id), but PRAGMA foreign_keys was never enabled — SQLite
defaults it to OFF per connection, and _connect() did not set it. Two
orphan rows existed in messages (conversation_id='test123' pointing at
a never-existing conversation; both rows from one ~11-second test event
on 2026-04-26).

Audit before changing the PRAGMA:
- All FOREIGN KEY declarations across both DBs (conversations.db,
  sessions.db) accounted for via PRAGMA foreign_key_list on each
  table. Only one FK exists: messages.conversation_id ->
  conversations.id, ON DELETE NO ACTION.
- All tables enumerated via sqlite_master. Two tables in
  conversations.db (conversations, messages); one in sessions.db
  (sessions). No surprises.
- PRAGMA foreign_key_check confirmed exactly the 2 known orphans and
  zero violations elsewhere.

Both delete paths in api.py (delete_conversation at :471, and
clear_all_conversations at :986) already delete from messages BEFORE
conversations, so cascade behavior was correct in code. The orphan
state was caused by a direct INSERT against a non-existent
conversation_id at chat-test time, which an unenforced FK silently
accepted. Turning the PRAGMA on prevents this class of bug at insert
time, not delete time — no delete-path code changes were needed.

Order of operations followed the constraint that orphan cleanup must
precede PRAGMA-on (SQLite would not retroactively delete orphans, but
foreign_key_check would surface them confusingly on any future
operation that touched the messages table):
1. DELETE FROM messages WHERE conversation_id NOT IN (SELECT id FROM
   conversations) — removed the 2 known orphans.
2. Added PRAGMA foreign_keys=ON to _connect() so every connection
   from _connect_conversations() and _connect_sessions() gets FK
   enforcement (SQLite requires per-connection setting).
3. Restarted aaronai.service.

Verification:
- Smoke: GET /api/conversations and /api/conversations/{id}/messages
  both return 200 with expected payloads against the live api.
- E2E single-delete: synthetic conversation + 2 messages inserted via
  the api's _connect helper (FK on); DELETE /api/conversations/{id}
  via the live endpoint removed both rows from both tables.
- Clear-all e2e: skipped on live DB (destructive) — code shape is
  structurally identical to single-delete, no FK-relevant logic
  difference.
- Load-bearing negative test: INSERT into messages with a
  non-existent conversation_id via _connect_conversations() raised
  sqlite3.IntegrityError("FOREIGN KEY constraint failed"). This is
  what proves the PRAGMA actually took effect, not just that we set
  it.

Final counts: 7 conversations, 290 messages (down from 292 by the 2
orphans cleaned up).

Note: an explicit BEGIN/COMMIT around the two-execute delete paths
was considered and skipped. SQLite's implicit-transactional default
already gives the atomicity needed; explicit transactions would be
clarity-only and belong in a separate commit.
2026-05-04 16:41:55 +00:00
aaron d985f9e91e dream.py: raise_for_status on manifest writes; total_chunks as actual corpus count
Two correctness bugs in dream_pipeline manifest assembly.

write_manifest at lines 487-491 swallowed HTTP 4xx/5xx responses
silently. requests.put() only raises on transport-level errors (DNS,
connection refused, timeout); 401/403/500/507 come back as Response
objects and never trigger the except. The code printed "Manifest
written" while the manifest never persisted. The same file's deliver()
function at line 434 already used response.raise_for_status() — the
pattern was already established, write_manifest just skipped it.

Fix: bind the response and call raise_for_status() before the success
print. The except message changes from "(non-critical)" to "manifest
not persisted" because HTTP failure now means manifest data was lost,
which is critical, not quiet.

corpus_data["total_chunks"] at lines 621-622 stored
delta["new_chunks"], duplicating the sibling field
new_chunks_since_last_dream. The field name claimed absolute corpus
size; the value was a delta of recently-touched files. Verified in
live manifests: total_chunks: 0 while pgvector held 11,379+ document
embeddings.

Fix: query SELECT COUNT(*) FROM embeddings inside dream_pipeline,
store as total_chunks. Tightly-scoped one-shot connect via the
existing get_pg() helper. Telemetry query failure is treated as
non-critical and falls back to 0 — pgvector hiccup should not crash
an otherwise successful dream pipeline.

Bonus finding (not fixed in this commit): new_chunks_since_last_dream
is itself misnamed. observe_corpus() reads the watcher's mtime cache
and counts files (not chunks) whose mtime is newer than last_dream.
Both fields were "files touched since last dream" duplicated under
two different names; this commit fixes only the total_chunks
semantics. Renaming new_chunks_since_last_dream is out of scope —
manifests are write-only telemetry today, no consumer reads either
field, and the rename is a separate decision.

Verification: real pipeline run produced manifest with total_chunks
matching SELECT COUNT(*) directly; doubled as a smoke test for the
embedder cache (single Loading weights line), type_distribution
propagation, and the manifest write success path.
2026-05-04 16:29:04 +00:00
aaron b9eea6cb62 watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes
Three rows in ingest_failures were Office lockfile leftovers whose
filename starts with ~� (~ followed by the UTF-8 replacement
character) instead of ~$. Somewhere in the Nextcloud sync chain the $
byte was lost or replaced; the file now lives on disk as a real file
with this corrupted name. The watcher's ("~$", ".") prefix filter
didn't match, so each cycle tried to ingest these as pptx, hit
BadZipFile inside python-pptx (lockfiles aren't real Office documents),
and they ended up permanently in ingest_failures.

Three filter sites in watcher.py applied the lockfile prefix check:
  - ingest_file() at :127
  - get_changed_files() at :200
  - IngestHandler._should_ignore() at :290

All three now match ("~$", "~", ".") — broadened to catch any tilde
prefix, not just ~$. The cross-check against pgvector embeddings and
disk found zero legitimate tilde-prefixed files in the corpus, so the
broader filter has no false-positive risk in this corpus.

Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%').
Unresolved count drops 97 → 94.

If a fourth filter site is ever added, the right shape is consolidating
the lockfile prefix check to a shared function or constant. Three
parallel sites with three different tuple orderings is acceptable for
now but worth normalizing if the surface grows.
2026-05-04 16:19:56 +00:00
aaron 93c0d89308 encoding.py: extend docx and pptx extractors to walk tables, headers/footers, text-boxes, group shapes, and notes
The previous extractors walked only top-level body paragraphs (docx) and
top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text"
ingest failures revealed that 13 docx files in the failure cohort have
100% of their content in tables (paras_with_text=0, table_cells=6-108).
These are syllabi, rosters, rubrics, and homework worksheets structured
as a single document-wide table — high-value academic content the corpus
was silently missing.

docx walker now covers:
- body paragraphs (existing)
- tables, including nested tables in cells (recursive helper)
- header and footer paragraphs per section
- text-box content via XPath against w:txbxContent (no first-class API
  in python-docx; future-proofing — none of the current failure cohort
  has text-boxes)

pptx walker now covers:
- top-level shape text (existing)
- recursive descent into group shapes
- table cell text via shape.has_table / shape.table.iter_cells()
- speaker notes via slide.notes_slide.notes_text_frame.text

Out of scope: SmartArt diagrams, chart titles/labels, OLE objects,
content controls. None of the current failure cohort has these.

Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are
image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two
GH Slicer Notes variants — all PICTURE-shape decks with no text in any
walkable structure). They stay in ingest_failures unresolved, awaiting
OCR or path exclusion.

Side effect worth noting: the regression check on 4 known-good files
that were already producing embeddings showed all four gained content
under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars
(+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA
program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew
22,259 to 23,546 (+1,287). These files have been in the corpus all
along with table or notes content silently dropped. They will surface
the additional content on next re-ingest, improving retrieval quality
for any future query that touches them.

Cleanup: ingest_file already calls resolve_ingest_failure on successful
ingest, so the 13 recovered files were marked resolved=TRUE during the
retry pass. No separate cleanup SQL was needed.
2026-05-04 16:12:56 +00:00
aaron f18fb64fe5 watcher.py: exclude generative-graphic folders and zero-byte files
Two-sample diagnostic of the 128 ingest_failures rows surfaced two
folders whose contents are exclusively non-text PDFs (iText-produced
generative graphics from Processing sketches and computational design
sketches) and three zero-byte test artifacts. None of these have ever
produced an embedding chunk, and they have nothing extractable to
contribute. Excluding them removes 19 / 128 (15%) of the locked-out
failures from the cohort and prevents future versions of the same
patterns from re-failing.

Folder exclusions use path.parts membership rather than substring
matching — eliminates false-match risk if similarly-named folders
appear elsewhere in the corpus (e.g. an unrelated "Generative Design"
or "Computational Design 2017" directory created later). The existing
"Admin/Backups" / "Journal/Media" substring checks are looser, but
new exclusions take the tighter pattern.

Zero-byte filter goes in get_changed_files() only — the actual
ingestion gate. Adding stat() to _should_ignore() (the FS-event noise
filter) would introduce a race where the file is gone between event
fire and stat call. Empty files briefly trigger pending=True but
produce no work after debounce; cosmetic only.

Cleanup applied separately via UPDATE: 19 ingest_failures rows for
these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110.

Verified: get_changed_files() with empty state returns 1418 changed
files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent
from the result, control file present. Watcher service restarted
clean; startup scan reports no missed files.
2026-05-04 06:24:08 +00:00
aaron 72e07afc03 watcher.py: do not mark failed ingests as successfully ingested
ingest_files() updated state[path] = mtime unconditionally after every
ingest_file() call. ingest_file() returns 0 when text extraction fails,
embedding fails, no chunks are produced, or the pgvector write fails —
in every one of those cases, the path was still recorded as ingested
at the current mtime. On the next pass, get_changed_files() saw the
mtime match and skipped the file, locking it out of the corpus until
something modified it on disk.

record_ingest_failure() writes to a UI-visible failures table, but
nothing reads that table to retry. So failures accumulated silently:
the file was simultaneously logged as failed AND tracked in
watcher_state as up-to-date, and the second condition won.

Fix: only update watcher_state when ingest_file returns count > 0.
Failed ingests will be retried on the next watcher cycle until they
succeed or are explicitly excluded.

Diagnostic at fix time: 129 rows in ingest_failures, 128 currently
locked out of the corpus (filepath in watcher_state with mtime matching
current disk). 128/129 are text_extraction failures, mostly scanned
PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer
exists on disk. 0 have had their disk mtime change since failing — i.e.
without this fix, none of them would ever retry. Cross-check shows
watcher_state has 1466 paths vs. 1061 distinct sources in pgvector
embeddings, leaving a residual silent-gap of ~276 files after
accounting for failures.

Historical cleanup of files already locked out by this bug is tracked
separately. New failures from this commit forward will retry naturally.
2026-05-04 03:52:01 +00:00
aaron c3011c80a5 api.py: route all sqlite3.connect() through helpers; enable synchronous=NORMAL per-conn
Followup to 4204806 (WAL + index + backup.sh). The previous commit
deferred synchronous=NORMAL because it's a per-connection PRAGMA and
api.py has 16 sqlite3.connect() call sites — setting it once at init
would have applied to nothing afterwards.

Adds three helpers near the *_DB constants:
- _connect(path): inner; sets PRAGMA synchronous=NORMAL and uses
  timeout=5.0 (5000ms busy_timeout) on every new connection.
- _connect_conversations(), _connect_sessions(): named wrappers so call
  sites read explicitly.

Mechanical replacement at all 16 call sites: 4 sessions, 12 conversations.
No semantic change beyond the PRAGMA + busy_timeout — every site still
opens-then-closes, no held-open connections.

busy_timeout=5000ms is cheap insurance: under WAL with api.py as sole
writer, contention should be near-zero, but the backup.sh online-backup
path briefly holds a read lock on the source, and any future second
writer would otherwise hit SQLITE_BUSY immediately on contention.

Combined effect with WAL: per-write fsync count drops from ~2 to ~1
(WAL alone) further reduced by synchronous=NORMAL deferring fsyncs to
checkpoint boundaries. No durability loss for the use case (single
host, app crash tolerated, OS crash gives at most one lost transaction).

Not included: foreign_keys=ON. Audit found 2 orphan rows in messages
(conversation_id pointing to deleted conversations) and untested write
paths that could begin raising IntegrityError. Tracked as separate
followup: inspect orphans, identify the delete path that didn't
cascade, clean up, then enable enforcement and test chat delete flow
end-to-end.
2026-05-04 03:39:13 +00:00
aaron 4204806c80 conversations.db, sessions.db: enable WAL, add message index; update backup.sh
Both databases ran with journal_mode=delete — every write rewrote the
rollback journal per transaction. WAL eliminates the journal-rewrite and
lets readers run without blocking writers.

Index on messages(conversation_id, timestamp DESC) is preventive — only
280 rows today, but the access pattern (load conversation history in
order) is exactly what a composite index serves, and we don't want to
re-revisit this when the table grows.

backup.sh updated in the same commit because WAL changes the on-disk
layout: a bare `cp` of just the .db file can miss recently-committed
transactions that still live in the -wal sidecar, and can race with
concurrent writes to produce a torn file. Switched to the SQLite Online
Backup API via python3 -c "...src.backup(dst)..." — same mechanism as
the sqlite3 CLI's `.backup` (which isn't installed on this host),
handles WAL correctly without forcing a checkpoint, and is non-locking
from the writer's perspective. Verified backup integrity_check returns
ok and row counts match.

Note: synchronous=NORMAL was considered but deferred — it's a
per-connection PRAGMA, and applying it correctly requires a connect
helper that wraps every sqlite3.connect() call site in api.py (~14
sites). Out of scope for this commit; tracked as a follow-up. WAL alone
delivers the journal-rewrite elimination and reader/writer concurrency
improvements; the additional fsync reduction from synchronous=NORMAL is
a smaller marginal win on top.

Confirmed via concurrency audit that api.py is the sole writer to both
databases. ingest_conversations.py and dream.py are read-only consumers
of conversations.db; nothing else touches sessions.db.
2026-05-04 03:24:51 +00:00
aaron c5fc517fef ingest_conversations.py: lazy-load embedder to match ingest.py pattern
Embedder was instantiated at module import (~30-60s, ~200MB) regardless
of whether new conversations existed. On nights with no new content
(most nights per the logs), the script paid the load cost and exited
immediately. ingest.py:134 already uses lazy loading; this brings the
two ingest scripts into a consistent shape.
2026-05-04 03:13:45 +00:00
aaron b35d44ef58 dream.py: cache the SentenceTransformer embedder across retrieve() calls
Pipeline mode calls retrieve() three times (NREM, Early REM, Late REM).
Previously each call re-imported and re-instantiated SentenceTransformer
("all-MiniLM-L6-v2"), allocating ~200MB and spending 30-60s on disk->CPU
init three times sequentially. lru_cache(maxsize=1) makes the load happen
once per process.

Expected: pipeline runtime drops ~100-180s, removes 2x redundant 200MB
allocations, and reduces transient memory pressure during the same window
when other nightly jobs may run.
2026-05-04 03:11:22 +00:00
aaron a27f22ceaf api.py: switch whisper to distil-large-v3, beam_size=1, cpu_threads=4
Three changes to reduce voice-note transcription latency on the VPS:
- Model: large-v3 -> distil-large-v3 (~6x faster, near-identical English
  accuracy; language is already hardcoded "en").
- beam_size: 5 (default) -> 1 (~3-4x faster on clean audio).
- cpu_threads: 8 -> 4 (the box has 8 cores running api, dreamer, watcher,
  nextcloud concurrently; ctranslate2's inter-op pool plus context switching
  makes 4 effectively faster than 8 here).

Combined effect expected ~10-15x over prior config. No accuracy regression
expected for the voice-note use case (English, clean audio, domain terms
already supplied via initial_prompt).
2026-05-04 01:00:32 +00:00
aaron 7c7b649775 embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)
Writers now enforce type and created_at:
  - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
    'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
    DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
    created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
    re-ingest re-classifies type but does not overwrite a backfilled mtime.
  - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
    EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
    convo.updated_at; re-runs should refresh).
  - Column-level NOT NULL is not added; application-layer raise gives a
    faster, more debuggable failure than a Postgres constraint error.

Retrieval propagates type into chunks:
  - retrieve() SELECT now includes type; chunk dicts carry "type": etype.
  - WHERE clause built dynamically from excluded_sources and the new
    --type-filter CLI arg (experimental, default None, pgvector retrieval
    only — Graphiti chunks have no embeddings.type to filter on).
  - retrieve_graphiti unchanged; its chunks lack the type field.

Manifests carry type_distribution per stage:
  - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
    early_rem, late_rem — a Counter over chunk types, filtering None so
    Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
    distribution. Pgvector chunks always carry type post-backfill; if None
    appears, the backfill or writer enforcement has regressed.

Verification:
  B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
       all 84 chunks preserved created_at=2026-04-27T06:11:55Z
  B2 missing-type assertion raises ValueError, no row leaked to embeddings
  B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
       type_filter only, excl 2 elems, excl 1 elem edge case, both};
       all five plans use HNSW index scan with correct Filter clauses
  C1 retrieve("nrem") returns 8 chunks each carrying "type" key
  C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
       2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
       no single type >=90%)

The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
2026-05-04 00:15:43 +00:00
aaron 3c7c228db0 embeddings: backfill type and created_at (Improvement #2 part A)
Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit)
and 12,109 created_at-NULL rows via five batches:

  C1 filepath_stat:        9,649  filesystem mtime via metadata.filepath
  C2 watcher_state_unique:   676  unique source-name lookup in watcher_state
  C3 watcher_state_collision_pick_latest_of_N:
                             234  collision; most-recent watcher mtime
  C4 chatgpt_export:       1,548  convo create_time from export JSONs
                                  (168/168 distinct convo_ids resolved)
  C5 sentinel:                 2  2026-04-26T00:00:00Z (pgvector migration date)

Provenance written to metadata.type_source and metadata.created_at_source
on every row changed by this run. type_source is empty on rows where the
type field was already populated pre-run; in those cases the snapshot
table is the source of truth for what changed.

Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type,
created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join).

Verification:
  V1 live counts:      type_null=0  ca_null=0
  V2 spot-check 11 rows across cohorts: provenance correct
  V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved
  V4 cross-check vs snapshot: reconciles per-provenance to dry-run

Read-side use (B + C: writer enforcement + minimal retrieval read) deferred
to a separate session. The backfill is complete and verified, but the type
and created_at fields are not yet load-bearing — every current reader still
ignores them. Without B+C this lands as data prep, not behavior change.
2026-05-03 23:58:53 +00:00
aaron 2df1a2fe01 docs/inventory: layer 2026-05-03 updates (resolutions, corrections, new findings)
Inventory dated 2026-05-02 is preserved as a point-in-time snapshot. Today's
updates are layered on top in a dated addendum section after "Findings
summary" and before "Phase 1 — Scripts" so the original snapshot reads as
written and readers can see what changed and when.

Resolved:
- NREM-shape divergence #1 (`dream.py` cumulative cross-night exclusion
  500-cap) — replaced with session-scoped novelty.

Corrections to existing findings:
- `stage2_metadata` lives on `stage_3_queue`, not `stage_2_queue` (the
  2026-05-02 entry implied otherwise). Verified by direct schema read.
- Stage 2 char_length gate runs *before* the Mistral call. For sub-2000-char
  docs, Mistral is never invoked — frames are not extracted then discarded,
  they are simply not extracted. Reframes the architecture's "Stage 2
  produces orientation for everything" commitment.

New findings (from the 2026-05-03 frame analysis):
- `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation
  sources have zero frame coverage by design. Combined with the char-gate
  exclusion and Stage 2 failures, only 56% of corpus has any frame data.
- All 14 voice notes and all 39 dream outputs are in the 339-doc gap.
  Primary capture and self-reflection channels are silent to the frame
  system; dreamer cannot frame-condition on its own output.
- File-type \u00d7 frame stratification provides discriminating signal that
  cross-links Improvement #3 to the existing `embeddings.type` NULL-rate
  finding.

Same NREM shape as the original cumulative-exclusion bug — the architecture's
stated commitment and what the code actually does diverge silently. This is
exactly what the inventory exists to surface.
2026-05-03 20:32:55 +00:00
aaron ed2d090afc experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3)
Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).

Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
  (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
  fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
  co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
  data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.

Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
  validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
  Stage 2 by design (`ingest_conversations.py` writes directly to
  embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
  12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
  Primary capture and self-reflection channels are silent to the frame
  system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
  `Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
  cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
  prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
  Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
  rows.

No production code touched. View is droppable; script is read-only.
2026-05-03 20:32:37 +00:00
aaron e5898f3019 dream.py: replace cumulative cross-night exclusion with session-scoped novelty (Track 1 Finding 1)
The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on
overflow) was hiding ~40% of the corpus from Early REM and Late REM after the
cap filled. The architecture and reframe both specify session-scoped novelty,
not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02
NREM exclusion fix.

Changes:
- Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key
  from `dreamer_state.json` at pipeline start.
- Early REM excludes only the current session's NREM high-scorers.
- Late REM excludes only the current session's NREM \u222a Early REM.
- Remove the across-night accumulation block at the end of the pipeline; reuse
  the in-scope state object for the post-pipeline metadata write (eliminates a
  redundant disk re-read that was reintroducing the legacy key).

NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem",
excluded_sources=None)`).

Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early
REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key
absent from `dreamer_state.json` post-run.
2026-05-03 20:32:15 +00:00
aaron 1101bef226 scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)
Consolidates four extract paths and two extract-chunk-embed-write pipelines
into a single shared encoding module. Fixes the embedder lifecycle
divergence between watcher and /api/reindex (no more 200MB reload per
reindex click) and unifies failure tracking so /api/reindex failures now
surface in SettingsPanel "Ingest Health".

New files:
- scripts/encoding.py — extract_text, chunk_text, chunk_and_embed,
  write_embeddings_batch
- scripts/failures.py — record_ingest_failure, resolve_ingest_failure
  (shared by watcher.py and ingest.py)

Refactored:
- scripts/watcher.py — drops local extract/chunk/embed implementations
  and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding
  and failures. Now writes ingest_failures row on empty-text-extract
  (was silent return 0).
- scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder,
  embedder=None) for in-process invocation; CLI back-compat preserved via
  ingest_folder wrapper. Module-level SentenceTransformer load removed.
- scripts/corpus_integrity.py — imports extract_text from encoding;
  extract_text_for_retry function removed.
- scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses
  module-level embedder; no subprocess); new /api/reindex/status endpoint
  reading ~/aaronai/reindex_status.json; /api/corpus/retry imports
  extract_text from encoding; INGEST_SCRIPT constant removed (dead after
  this refactor); 409 reentrance guard prevents double-click stomping.

Behavior changes:
- /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks
  threadpool, doesn't block API thread.
- /api/reindex no longer reloads SentenceTransformer on each click.
- /api/reindex failures newly write to ingest_failures (visible in
  SettingsPanel "Ingest Health" — badge will jump on first reindex).
- New embeddings rows always have created_at = NOW() (canonical, server-side).
- New embeddings rows always include metadata.folder field (None when not
  derivable).
- /api/reindex returns 409 on second click while a job is running.
- New /api/reindex/status endpoint for polling.

Existing 9,815 NULL created_at rows remain unchanged; backfill is a
separate decision if desired.

199 insertions, 256 deletions across 6 files (codebase shrinks net).

Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11).
Pre-commit verification: BackgroundTasks already imported, sys.path
resolves correctly via script-path semantics, static import clean.
2026-05-03 01:40:47 +00:00
aaron a317df66f8 dream: factor prompts into module-level templates, repair prompt_hash (Track 1 Finding 11)
prompt_hash() in dream.py was hashing function __doc__ strings, but the
synth functions don't have docstrings, so the hash was always MD5("") =
d41d8cd9 for every dream. The manifest field meant to detect undeclared
prompt drift carried no useful information.

Refactor:
- Each synth function's prompt template moved to a module-level constant
  (NREM_PROMPT_TEMPLATE, EARLY_REM_PROMPT_TEMPLATE, LATE_REM_PROMPT_TEMPLATE,
  SYNTHESIS_PROMPT_TEMPLATE, LUCID_PROMPT_TEMPLATE) using str.format()
  placeholders instead of f-string interpolation.
- Synth functions call TEMPLATE.format(...) at use time. Output is byte-
  identical to the previous f-string implementation.
- prompt_hash() now hashes the four pipeline template constants (lucid is
  on-demand, not part of the nightly manifest — preserves prior scope).
- LUCID_DEFAULT_TASK extracted as a named constant from the lucid fallback
  question (factoring only, no behavior change).
- PROMPT_VERSION_* constants and synth function signatures untouched.
- v1.1 register-shift comment in synthesize_early_rem preserved inline.

The post-fix hash will differ from d41d8cd9 (verified: b65695a1 in static
test). Historical manifests still carry d41d8cd9; the discontinuity is
intentional — pre-fix hashes were equally meaningless and faking continuity
would be worse than acknowledging the break.

Found by Track 1 inventory 2026-05-02 (Finding 11 / divergence #11).
Verified static import + hash determinism before commit.
2026-05-03 00:24:21 +00:00
31 changed files with 7754 additions and 561 deletions
+1
View File
@@ -8,6 +8,7 @@ dreamer_state.json
corpus_integrity_report.json corpus_integrity_report.json
watcher_state.json watcher_state.json
watcher_status.json watcher_status.json
reindex_status.json
# Logs (these belong in /var/log/) # Logs (these belong in /var/log/)
*.log *.log
@@ -65,6 +65,38 @@ The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that m
---
## Updates — 2026-05-03 session
*Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
### Resolved
- **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
### Corrections to existing findings
- **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
- **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
### New findings from 2026-05-03 frame analysis (Improvement #3)
- **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
- **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
- **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
### Artifacts produced 2026-05-03
- **Code change:** `scripts/dream.py` (Improvement #1).
- **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
- **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
- **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
- **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
--- ---
## Phase 1 — Scripts ## Phase 1 — Scripts
+105
View File
@@ -0,0 +1,105 @@
# OCR install record — 2026-05-04
## Machine
- Host: aaronai-01 (VPS)
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
## apt packages installed
| package | version | source |
|---|---|---|
| tesseract-ocr | 5.3.4-1build5 | noble |
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
## pip packages installed (into /home/aaron/aaronai/venv)
| package | version |
|---|---|
| pytesseract | 0.3.13 |
| ocrmypdf | 17.4.2 |
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
## Smoke test 1 — `tesseract --version`
```
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX512BW
Found AVX512F
```
## Smoke test 2 — `tesseract --list-langs`
```
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
eng
osd
```
## Smoke test 3 — pytesseract on a slide image
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
- Wall-clock: 0.220s
- Chars extracted: 126
- First 200 chars:
```
Generates the Bounding Box for NESS
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
```
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
- Wall-clock: 3.72s (whole PDF, 4 pages)
- Exit: 0
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
- First 200 chars of OCR'd text:
```
nN New Paltz
STATE UNIVERSITY OF NEW YORK
The Honors Program
May 30, 2017
Dear Aaron,
Thank you for serving as a reader for Caryn Byllotts thesis on "Recall/Reconstruct: The Exploration of
Memory
```
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
## Reference timing
| operation | input size | wall-clock |
|---|---|---|
| pytesseract single image | 1768×504 PNG | 0.22s |
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
## Deferred — project dep-tracking
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
## Followups
- Async OCR worker (separate session). Use the reference timing above to size the queue.
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
- Project dep-manifest decision (see Deferred section above).
+175
View File
@@ -0,0 +1,175 @@
# Stage 2 Frame Analysis — 2026-05-03
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
---
## Verdict
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
---
## 1. Corpus-wide frame coverage
| Class | Count | % of corpus | Frame coverage |
|---|---|---|---|
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
| Files that failed Stage 2 | 12 | 1.0% | none |
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
## 2. Frame distribution (the docs that DO have frames)
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
| Frame | Count | % of frame-extracted docs |
|---|---|---|
| Education | 238 | 35.6% |
| Course | 58 | 8.7% |
| Programming | 43 | 6.4% |
| Design | 32 | 4.8% |
| Professional Experience | 24 | 3.6% |
| Employment | 24 | 3.6% |
| Research | 23 | 3.4% |
| 3D Printing | 22 | 3.3% |
| Project, Grading, Art, Budget | 21 each | 3.1% |
| Academic Integrity | 20 | 3.0% |
| Teaching, Technology, Attendance, Application | 1319 | — |
| Accommodation, Manufacturing, Coursework, Recommendation | 1013 | — |
**Per-doc frame count:** median 34 frames per doc; 76% of docs have 35 frames; one outlier doc has 30 frames (Mistral over-segmented).
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
## 3. Label hygiene
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
| Concept | Variant counts |
|---|---|
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
| Course Design | `Course Design`:9 + `Course_Design`:1 |
| Project Management | `Project Management`:7 + `Project_Management`:1 |
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
| (… 48 more) | |
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
## 4. Worker version drift
| Worker version | Doc count | Notes |
|---|---|---|
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
| v2.0 | 3 | Same key shape as v2.1 baseline. |
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
## 5. File-type signal
This is the most useful Track 2 finding from this report.
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
| Frame | pdf | docx | pptx | markdown | txt | dream |
|---|---|---|---|---|---|---|
| Education | 116 | 119 | 3 | — | — | — |
| Course | 29 | 29 | — | — | — | — |
| Programming | 12 | 10 | **15** | — | 6 | — |
| Application | **13** | 2 | — | — | — | — |
| 3D Printing | 11 | 3 | **8** | — | — | — |
| Manufacturing | 3 | 6 | 4 | — | — | — |
| Research | 9 | 13 | — | 1 | — | — |
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
## 6. Systematic exclusions inside the 339-doc gap
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
| Type | Count | What this is |
|---|---|---|
| pdf | 110 | Short PDFs (forms, single-page docs) |
| docx | 110 | Short Word docs |
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
| pptx | 31 | Short slide decks |
| txt | 28 | Plain-text files |
| voice_note | 14 | **Every voice note in the corpus** |
| markdown | 7 | Short markdown |
**Two specific systematic exclusions worth naming separately:**
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
---
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
**Defined scope (the coverage caveat):**
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
---
## 8. Recommended follow-ups (ordered by ROI)
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
---
## 9. Inventory edits flagged for session-end batch
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
## 10. Reproduction
```bash
cd ~/aaronai
venv/bin/python3 scripts/experiments/frame_distribution_report.py
# stdout: human-readable report
# json: experiments/frame_distribution_<date>.json
# view: stage2_frames_v (in pgvector DB)
```
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
@@ -0,0 +1,857 @@
{
"generated_at": "2026-05-03T23:47:54.802182+00:00",
"section_1": {
"overall": {
"total": 14069,
"type_null": 9815,
"ca_null": 12109,
"both_null": 9815,
"both_set": 1960
},
"cohorts": [
{
"type": "aaronai_conversation",
"ca_null": false,
"n": 71
},
{
"type": "chatgpt_conversation",
"ca_null": true,
"n": 1548
},
{
"type": "claude_conversation",
"ca_null": false,
"n": 1074
},
{
"type": "claude_memory",
"ca_null": true,
"n": 1
},
{
"type": "document",
"ca_null": false,
"n": 815
},
{
"type": "document",
"ca_null": true,
"n": 745
},
{
"type": null,
"ca_null": true,
"n": 9815
}
]
},
"section_2": {
"by_ext": [
{
"ext": ".pdf",
"rows": 6886
},
{
"ext": ".txt",
"rows": 1501
},
{
"ext": ".docx",
"rows": 1048
},
{
"ext": ".pptx",
"rows": 353
},
{
"ext": ".md",
"rows": 27
}
],
"classified": 9815,
"unclassifiable": 0
},
"section_3": {
"watcher_state_paths": 1462,
"watcher_state_basenames": 1183,
"watcher_state_collisions": 109,
"rows_with_filepath": {
"total": 9816,
"exists": 9649,
"missing": 167,
"outside_root": 0,
"sample": [
{
"id": "f317f238_0",
"source": "NO thesis proposal.docx",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF790 Thesis/Nic OConnor/NO thesis proposal.docx",
"mtime": "2024-01-26T15:06:09Z"
},
{
"id": "81047646_0",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "81047646_1",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
},
{
"id": "4e49d3b4_4",
"source": "Circuit Intro.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF310 Mechatronics/Week 1/Circuit Intro.pdf",
"mtime": "2022-01-31T23:28:56Z"
},
{
"id": "81047646_2",
"source": "Metals II Syllabus.pdf",
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
"mtime": "2012-02-26T22:45:15Z"
}
]
},
"rows_without_filepath": {
"total": 744,
"distinct_basenames": 228,
"unique_hit": 211,
"collision_hit": 16,
"unfound": 1
},
"collision_shapes": {
"total": 109,
"shape_counts": {
"multi-live": 95,
"live+archive": 14
},
"rows_affected_by_shape": {
"multi-live": 85,
"live+archive": 0
},
"samples": {
"multi-live": [
{
"name": "README.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/README.md",
"mtime": "2026-04-25T17:08:01Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Processing/Nature of Code/The-Nature-of-Code-Examples/The-Nature-of-Code-Examples-master/README.md",
"mtime": "2017-03-09T23:32:59Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/samples/hal/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/platforms/maven/README.md",
"mtime": "2016-12-21T10:37:05Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/hal/README.md",
"mtime": "2016-12-21T10:37:03Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/carotene/README.md",
"mtime": "2016-12-21T10:37:02Z"
}
]
},
{
"name": "3DPrinting_v2.pptx",
"rows_no_fp_using_this_name": 4,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Innovation Center/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:49Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Cuba/Assets/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:18Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Conference/3D Printing/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:34:15Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Workshops/3DPrinting_v2.pptx",
"mtime": "2026-04-24T19:30:14Z"
}
]
},
{
"name": "Print in Place.docx",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF205 CAD1/Print in Place.docx",
"mtime": "2017-08-24T03:50:36Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/ARS393 CVS1/Print in Place.docx",
"mtime": "2015-10-28T20:36:52Z"
}
]
}
],
"live+archive": [
{
"name": "dreamer-design-spec.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/dreamer-design-spec.md",
"mtime": "2026-04-25T22:55:11Z"
}
]
},
{
"name": "BirdAI-Ingest-Architecture.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/BirdAI-Ingest-Architecture.md",
"mtime": "2026-04-28T00:08:38Z"
}
]
},
{
"name": "graphiti-migration-plan.md",
"rows_no_fp_using_this_name": 0,
"candidates": [
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
},
{
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/Migration Plans/graphiti-migration-plan.md",
"mtime": "2026-04-27T17:54:40Z"
}
]
}
]
}
}
},
"section_4": {
"export_dir_exists": true,
"files": [
{
"name": "conversations-000.json",
"size": 19050556,
"mtime": "2026-04-24T19:55:44Z"
},
{
"name": "conversations-001.json",
"size": 29057594,
"mtime": "2026-04-24T19:55:44Z"
}
],
"convo_index_size": 169,
"sample_results": [
{
"id": "chatgpt_87cc0c47-aaf9-42da-8169-3b8922f3afba_0",
"source": "ChatGPT: Dog named Bird",
"convo_id": "87cc0c47-aaf9-42da-8169-3b8922f3afba",
"create_time": 1708835138.51948,
"create_time_iso": "2024-02-25T04:25:38.519480Z",
"resolved": true
},
{
"id": "chatgpt_689fab3e-d79c-8333-aeb5-7da4e9ca160d_0",
"source": "ChatGPT: Video understanding limitations",
"convo_id": "689fab3e-d79c-8333-aeb5-7da4e9ca160d",
"create_time": 1755294541.894811,
"create_time_iso": "2025-08-15T21:49:01.894811Z",
"resolved": true
},
{
"id": "chatgpt_611ff391-7fc0-42ea-bfd9-18dbe1739f19_7",
"source": "ChatGPT: Calculating Truncated Cone Angle",
"convo_id": "611ff391-7fc0-42ea-bfd9-18dbe1739f19",
"create_time": 1724020869.471264,
"create_time_iso": "2024-08-18T22:41:09.471264Z",
"resolved": true
},
{
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_50",
"source": "ChatGPT: Soul music playlist ideas",
"convo_id": "68ce1921-084c-8330-877c-78df1e03e54c",
"create_time": 1758337313.438344,
"create_time_iso": "2025-09-20T03:01:53.438344Z",
"resolved": true
},
{
"id": "chatgpt_c02e94f0-17db-4fd9-be04-13aaa1b728cb_1",
"source": "ChatGPT: Create Rhino plugin in Python",
"convo_id": "c02e94f0-17db-4fd9-be04-13aaa1b728cb",
"create_time": 1682716259.557353,
"create_time_iso": "2023-04-28T21:10:59.557353Z",
"resolved": true
}
],
"sample_resolved": 5,
"full_cohort": {
"distinct_convo_ids": 168,
"resolvable_from_export": 168,
"unresolvable": 0
}
},
"section_5": {
"earliest_per_type": [
{
"type": "aaronai_conversation",
"earliest": "2026-04-26T17:43:28.056503",
"latest": "2026-05-03T01:45:21.469613",
"rows": 71
},
{
"type": "claude_conversation",
"earliest": "2026-02-28T20:33:36.146998Z",
"latest": "2026-04-23T04:26:00.015419Z",
"rows": 1074
},
{
"type": "document",
"earliest": "2026-04-30 16:42:55.360736+00",
"latest": "2026-05-03 20:14:33.13663+00",
"rows": 815
}
],
"git_findings": [
"037d7475738352dd13620486b5154d58fa6c037b 2026-04-28 00:15:46 +0000 chore: archive deprecated chromadb and migration scripts",
"67766371789276ec4bcb8bac271b6eb9ddafa888 2026-04-27 05:16:37 +0000 Remove hardcoded PG password fallbacks \u2014 require PG_DSN env var in all scripts",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"8c8fba11b8d1b359b9b7722fc19b6ef562b812d8 2026-04-26 21:28:40 +0000 Add nightly conversation indexing \u2014 Aaron AI conversations into pgvector at 2:30AM",
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
"d2eed9890665a78a37fb5d336e8af75e7f2acb42 2026-04-26 20:19:49 +0000 Pre-pgvector migration checkpoint \u2014 upsert, allow_replace_deleted, maintenance timer"
],
"chromadb_candidates": [],
"proposed_sentinel": "2026-04-26T00:00:00Z",
"reasoning": "git f78b830 'Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL created_at all predate F11 and most predate the pgvector cutover itself. 2026-04-26 is the date the ChromaDB->pgvector migration script was committed, so any row currently in the embeddings table with NULL created_at must have been ingested on or after that date (when the table came into existence in current form). It is the tightest defensible upper bound on 'the row entered pgvector before timestamps were tracked', so it is the right sentinel."
},
"section_6": [
{
"cohort": "A (type NULL, ca NULL)",
"id": "f66c7390_6",
"source": "Design Guide - FDM for Composite Tooling 2.0.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2023-08-24T18:17:01Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "9cf798f8_151",
"source": "Shop Class as Soulcraft An inquiry into the value of the -- Crawford, Matthew.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-30T21:17:40.708026Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "fc378df0_329",
"source": "ulysses.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2017-10-12T14:20:59Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "812bd5c6_0",
"source": "Bennington College Cover Letter.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2013-03-29T20:32:23Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "91ccefdd_185",
"source": "Cognition in the Wild (A Bradford Book) -- Hutchins, Edwin.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:21:35Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "48fa3d53_2",
"source": "CMakeLists.txt",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2016-12-21T10:37:05Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "49e3545d_9",
"source": "RH50-TM-L1-EN-20140902.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2014-09-02T18:44:08Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "a8366d89_144",
"source": "Hackers and Painters_ Big Ideas from the Computer Age -- Graham, Paul.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:25:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "3e3097f8_46",
"source": "The Nature and Art of Workmanship -- David Pye.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-24T22:24:03Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "A (type NULL, ca NULL)",
"id": "87f9a5cf_269",
"source": "Supersizing the Mind_ Embodiment, Action, and Cognitive -- Andy Clark.pdf",
"existing_type": null,
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-25T17:14:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_61",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "592a1366_0",
"source": "2026-04-29-synthesis.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T08:00:57.634567Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cfb0a691_3",
"source": "Consolidator-0.1-Specification.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T03:34:31Z",
"inferred_ca_source": "watcher_state_unique"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "cd3d1914_57",
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T16:04:25Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "e65ef61c_8",
"source": "BirdAI-Research-Context.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-29T15:57:07Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "4dce2922_3",
"source": "cascade-optimization-protocol.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-28T05:46:24Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "077cc52d_1",
"source": "graphiti-migration-plan.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T17:54:40Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_70",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "1f15bccf_38",
"source": "BirdAI-Experiments-Log.md",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-05-01T16:40:02Z",
"inferred_ca_source": "filepath_stat"
},
{
"cohort": "B-doc-old (type='document', ca NULL)",
"id": "db356b14_13",
"source": "Finite and infinite games -- James Carse.pdf",
"existing_type": "document",
"existing_ca": null,
"inferred_type": "document",
"inferred_ca": "2026-04-27T06:11:55Z",
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_30",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_7",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_67fc4254-ef50-8009-9e0f-81864cca7cec_1",
"source": "ChatGPT: Job Application Review",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68f3d936-d74c-8329-91df-fe838e292170_5",
"source": "ChatGPT: SEC coaches with OSU ties",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d1b5b-bb4c-832b-8d2e-11a86a569fcc_4",
"source": "ChatGPT: Hosting app platforms",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_bfa1cd2f-b8ab-4b11-b844-c47b2fa70612_1",
"source": "ChatGPT: New chat",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_37",
"source": "ChatGPT: Soul music playlist ideas",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_10",
"source": "ChatGPT: External review for tenure",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_10",
"source": "ChatGPT: Website styling changes",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
"id": "chatgpt_690286bd-0758-8332-8491-5d00c77f4696_1",
"source": "ChatGPT: Airbrushing and finishing setup",
"existing_type": "chatgpt_conversation",
"existing_ca": null,
"inferred_type": "chatgpt_conversation",
"inferred_ca": "2026-04-26T00:00:00Z",
"inferred_ca_source": "sentinel"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_0",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_208",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_93",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_4",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_175",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_101",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_268",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "6ef0e329_5",
"source": "schematic-substrate-analysis.md",
"existing_type": "document",
"existing_ca": "2026-05-01 16:42:13.360795+00",
"inferred_type": "document",
"inferred_ca": "2026-05-01 16:42:13.360795+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "ead32317_132",
"source": "Richard Sennett - The Craftsman.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:23:34.012202+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:23:34.012202+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-doc-new (type='document', ca set)",
"id": "02db1224_86",
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
"existing_type": "document",
"existing_ca": "2026-04-30 22:21:56.211381+00",
"inferred_type": "document",
"inferred_ca": "2026-04-30 22:21:56.211381+00",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_dacf89e3-1ee7-400d-8461-ef5920c82fe3_96",
"source": "Claude: University of Utah interview teaching example",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-11T18:05:57.594832Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-11T18:05:57.594832Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_c0baf4b0-a7bb-4664-ac7b-98d7b02f56a6_26",
"source": "Claude: Weighing Utah versus Oklahoma",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-01T19:08:26.722197Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-01T19:08:26.722197Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_92",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_42dbddc5-12ba-4de7-a685-043473189da9_6",
"source": "Claude: I filling out my annual report...",
"existing_type": "claude_conversation",
"existing_ca": "2026-03-24T14:34:47.870625Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-03-24T14:34:47.870625Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-claude (type='claude_conversation', ca set)",
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_1344",
"source": "Claude: Setting up a custom OpenClaw instance",
"existing_type": "claude_conversation",
"existing_ca": "2026-04-23T04:26:00.015419Z",
"inferred_type": "claude_conversation",
"inferred_ca": "2026-04-23T04:26:00.015419Z",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_28ee8a447d3fc922_6",
"source": "Aaron AI: I'm working on you",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-26T17:43:28.056503",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-26T17:43:28.056503",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_20",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_21cabf771708df70_42",
"source": "Aaron AI: What should I be the most excited about right now?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-27T07:06:03.996026",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-27T07:06:03.996026",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_7deef2e8001f0e45_12",
"source": "Aaron AI: Who's covering for me on sabbatical?",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-04-29T22:19:45.312349",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-04-29T22:19:45.312349",
"inferred_ca_source": "preserved"
},
{
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
"id": "aaronai_conv_ed40b4278a9c8110_4",
"source": "Aaron AI: Let's say you're building an analog of the human brain, and ...",
"existing_type": "aaronai_conversation",
"existing_ca": "2026-05-03T01:45:21.469613",
"inferred_type": "aaronai_conversation",
"inferred_ca": "2026-05-03T01:45:21.469613",
"inferred_ca_source": "preserved"
}
]
}
@@ -0,0 +1,987 @@
{
"generated_at": "2026-05-03T20:21:33.558462",
"n_docs_with_frames": 668,
"n_distinct_labels": 1374,
"top_30_frames": [
[
"Education",
238
],
[
"Course",
58
],
[
"Programming",
43
],
[
"Design",
32
],
[
"Professional Experience",
24
],
[
"Employment",
24
],
[
"Research",
23
],
[
"3D Printing",
22
],
[
"Project",
21
],
[
"Grading",
21
],
[
"Art",
21
],
[
"Budget",
21
],
[
"Academic Integrity",
20
],
[
"Teaching",
19
],
[
"Technology",
18
],
[
"Attendance",
17
],
[
"Application",
15
],
[
"Accommodation",
13
],
[
"Manufacturing",
13
],
[
"Coursework",
11
],
[
"Recommendation",
10
],
[
"Manufacturing Process",
10
],
[
"Additive Manufacturing",
10
],
[
"Job Application",
10
],
[
"Exhibitions",
10
],
[
"Academic Administration",
9
],
[
"Communication",
9
],
[
"Course Design",
9
],
[
"Veteran and Military Services",
9
],
[
"Career",
9
]
],
"label_collisions": {
"conversational": [
[
"Conversational",
1
],
[
"conversational",
1
]
],
"content": [
[
"Content",
1
],
[
"content",
1
]
],
"cascade": [
[
"Cascade",
1
],
[
"cascade",
1
]
],
"education": [
[
"Education",
238
],
[
"education",
1
]
],
"academic record": [
[
"Academic_Record",
1
],
[
"Academic Record",
1
]
],
"independent study": [
[
"Independent Study",
5
],
[
"Independent_Study",
2
]
],
"project management": [
[
"Project Management",
7
],
[
"Project_Management",
1
]
],
"digital fabrication": [
[
"Digital Fabrication",
6
],
[
"digital_fabrication",
1
],
[
"digital fabrication",
1
]
],
"project proposal": [
[
"Project_Proposal",
2
],
[
"Project Proposal",
2
]
],
"academic integrity": [
[
"Academic Integrity",
20
],
[
"Academic_Integrity",
2
]
],
"3d printing": [
[
"3D Printing",
22
],
[
"3D_Printing",
7
]
],
"technical skills": [
[
"Technical Skills",
2
],
[
"Technical_Skills",
1
]
],
"course structure": [
[
"Course Structure",
7
],
[
"Course_Structure",
1
]
],
"course design": [
[
"Course Design",
9
],
[
"Course_Design",
1
]
],
"product design": [
[
"Product Design",
6
],
[
"Product_Design",
1
]
],
"professional experience": [
[
"Professional Experience",
24
],
[
"Professional_Experience",
6
]
],
"disability accommodations": [
[
"Disability Accommodations",
4
],
[
"Disability_Accommodations",
1
]
],
"material science": [
[
"Material_Science",
2
],
[
"Material Science",
4
]
],
"computational design": [
[
"Computational Design",
7
],
[
"Computational_Design",
1
]
],
"computer services policy": [
[
"Computer Services Policy",
6
],
[
"Computer_Services_Policy",
1
]
],
"work experience": [
[
"Work_Experience",
1
],
[
"Work Experience",
3
]
],
"academic program": [
[
"Academic Program",
7
],
[
"Academic_Program",
1
]
],
"project-based learning": [
[
"Project-Based Learning",
5
],
[
"Project-Based_Learning",
1
],
[
"Project-based Learning",
2
]
],
"art and design": [
[
"Art and Design",
6
],
[
"Art_and_Design",
1
]
],
"fdm technology": [
[
"FDM_Technology",
2
],
[
"FDM Technology",
1
]
],
"material selection": [
[
"Material_Selection",
1
],
[
"Material Selection",
1
]
],
"product development": [
[
"Product Development",
6
],
[
"Product_Development",
2
]
],
"market research": [
[
"Market_Research",
1
],
[
"Market Research",
2
]
],
"computer services": [
[
"Computer Services",
2
],
[
"Computer_Services",
1
]
],
"student evaluation of instruction": [
[
"Student Evaluation of Instruction",
1
],
[
"Student_Evaluation_of_Instruction",
1
]
],
"course management": [
[
"Course_Management",
1
],
[
"Course Management",
1
]
],
"grade policy": [
[
"Grade_Policy",
1
],
[
"Grade Policy",
1
]
],
"academic transcript": [
[
"Academic_Transcript",
1
],
[
"Academic Transcript",
1
]
],
"evaluation criteria": [
[
"Evaluation Criteria",
1
],
[
"Evaluation_Criteria",
1
]
],
"computer science": [
[
"Computer Science",
2
],
[
"Computer_Science",
1
]
],
"electrical circuit": [
[
"Electrical Circuit",
2
],
[
"Electrical_Circuit",
1
]
],
"digital logic": [
[
"Digital Logic",
1
],
[
"Digital_Logic",
1
]
],
"course description": [
[
"Course Description",
3
],
[
"Course_Description",
1
]
],
"organizational structure": [
[
"Organizational_Structure",
1
],
[
"Organizational Structure",
1
]
],
"digital design": [
[
"Digital_Design",
1
],
[
"Digital Design",
4
]
],
"contact information": [
[
"Contact Information",
2
],
[
"Contact_Information",
1
]
],
"professional career": [
[
"Professional_Career",
2
],
[
"Professional Career",
1
]
],
"personal projects": [
[
"Personal_Projects",
1
],
[
"Personal Projects",
2
]
],
"ai development": [
[
"AI_Development",
1
],
[
"AI Development",
1
]
],
"university service": [
[
"University Service",
2
],
[
"University_Service",
1
]
],
"professional exhibitions and publications": [
[
"Professional Exhibitions and Publications",
1
],
[
"Professional_Exhibitions_and_Publications",
1
]
],
"selected external consulting and design work": [
[
"Selected External Consulting and Design Work",
1
],
[
"Selected_External_Consulting_and_Design_Work",
2
]
],
"academic career": [
[
"Academic_Career",
1
],
[
"Academic Career",
2
]
],
"technology integration": [
[
"Technology Integration",
2
],
[
"Technology_Integration",
1
]
],
"artistic practice": [
[
"Artistic_Practice",
1
],
[
"Artistic Practice",
1
]
],
"multi-material 3d printing": [
[
"Multi-Material 3D Printing",
1
],
[
"Multi-material 3D Printing",
1
]
],
"community engagement": [
[
"Community Engagement",
3
],
[
"Community_Engagement",
1
]
],
"digitaldesignandfabrication": [
[
"DigitalDesignAndFabrication",
1
],
[
"DigitalDesignandFabrication",
1
]
],
"professional background": [
[
"Professional Background",
3
],
[
"Professional_Background",
1
]
]
},
"per_doc_frame_count": {
"3": 282,
"5": 67,
"4": 195,
"2": 57,
"7": 13,
"11": 5,
"13": 2,
"15": 1,
"12": 4,
"6": 21,
"8": 8,
"10": 4,
"9": 6,
"30": 1,
"14": 1,
"18": 1
},
"top_30_pairs": [
{
"a": "Course",
"b": "Education",
"count": 46
},
{
"a": "Education",
"b": "Project",
"count": 20
},
{
"a": "Design",
"b": "Education",
"count": 20
},
{
"a": "Education",
"b": "Professional Experience",
"count": 20
},
{
"a": "Education",
"b": "Employment",
"count": 20
},
{
"a": "Education",
"b": "Technology",
"count": 18
},
{
"a": "Education",
"b": "Grading",
"count": 17
},
{
"a": "Education",
"b": "Research",
"count": 15
},
{
"a": "Art",
"b": "Education",
"count": 15
},
{
"a": "Attendance",
"b": "Grading",
"count": 14
},
{
"a": "Course",
"b": "Grading",
"count": 13
},
{
"a": "Academic Integrity",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Education",
"count": 11
},
{
"a": "Attendance",
"b": "Course",
"count": 11
},
{
"a": "Application",
"b": "Employment",
"count": 11
},
{
"a": "Coursework",
"b": "Education",
"count": 10
},
{
"a": "Course",
"b": "Design",
"count": 10
},
{
"a": "Course",
"b": "Programming",
"count": 10
},
{
"a": "Application",
"b": "Education",
"count": 10
},
{
"a": "Budget",
"b": "Education",
"count": 10
},
{
"a": "Academic Integrity",
"b": "Accommodation",
"count": 9
},
{
"a": "Education",
"b": "Teaching",
"count": 9
},
{
"a": "Education",
"b": "Programming",
"count": 9
},
{
"a": "Academic Integrity",
"b": "Attendance",
"count": 9
},
{
"a": "Course",
"b": "Project",
"count": 8
},
{
"a": "Research",
"b": "Teaching",
"count": 8
},
{
"a": "Grading",
"b": "Project",
"count": 7
},
{
"a": "Art",
"b": "Technology",
"count": 7
},
{
"a": "Academic Integrity",
"b": "Course",
"count": 7
},
{
"a": "Accommodation",
"b": "Course",
"count": 7
}
],
"folder_crosstab": {
"Education": {
"pdf": 116,
"docx": 119,
"pptx": 3
},
"Course": {
"pdf": 29,
"docx": 29
},
"Programming": {
"pptx": 15,
"docx": 10,
"pdf": 12,
"txt": 6
},
"Design": {
"pdf": 13,
"docx": 16,
"pptx": 3
},
"Professional Experience": {
"docx": 13,
"pdf": 11
},
"Employment": {
"pdf": 15,
"docx": 9
},
"Research": {
"pdf": 9,
"docx": 13,
"markdown": 1
},
"3D Printing": {
"docx": 3,
"pdf": 11,
"pptx": 8
},
"Project": {
"pdf": 8,
"docx": 12,
"markdown": 1
},
"Grading": {
"pdf": 10,
"docx": 11
},
"Art": {
"docx": 11,
"pdf": 9,
"pptx": 1
},
"Budget": {
"docx": 6,
"pdf": 15
},
"Academic Integrity": {
"docx": 17,
"pdf": 3
},
"Teaching": {
"pdf": 9,
"docx": 10
},
"Technology": {
"docx": 15,
"pdf": 3
},
"Attendance": {
"docx": 11,
"pdf": 6
},
"Application": {
"pdf": 13,
"docx": 2
},
"Accommodation": {
"docx": 11,
"pdf": 2
},
"Manufacturing": {
"docx": 6,
"pptx": 4,
"pdf": 3
},
"Coursework": {
"pdf": 8,
"docx": 3
}
},
"bin_totals": {
"markdown": 64,
"pdf": 286,
"pptx": 70,
"txt": 28,
"docx": 217,
"dream_output": 3
},
"worker_versions": {
"2.0": 3,
"2.1": 665
},
"data_gap": {
"count": 339,
"by_type_bin": {
"pdf": 110,
"voice_note": 14,
"docx": 110,
"dream_output": 39,
"pptx": 31,
"txt": 28,
"markdown": 7
},
"char_length": {
"min": 6,
"max": 1998,
"median": 1077
},
"sample_sources": [
"Thesis Paper Guidlines.pdf",
"2026-04-30-17-06-voice.md",
"2026-04-30-15-59-voice.md",
"2026-04-30-16-53-voice.md",
"2026-04-30-16-23-voice.md",
"2026-04-29-17-52-voice.md",
"2026-04-30-16-59-voice.md",
"Outline for 3D Printed Materials for Foundry Casting.docx",
"2026-04-26-22-52-voice.md",
"2026-04-30-synthesis.md"
]
},
"corpus_coverage": {
"total_distinct_sources_in_embeddings": 1255,
"conversations_no_frames_by_design": 198,
"files_with_frames": 704,
"files_short_no_frames": 339,
"files_stage2_failed": 12,
"frame_coverage_pct": 56.1
}
}
+4
View File
@@ -0,0 +1,4 @@
# Local backups created by apply.sh — environment state, not source.
# Keeping these out of version control prevents repo bloat and avoids
# checking in graphiti-core's Apache-2.0 source under our repo's tree.
backups/
+58
View File
@@ -0,0 +1,58 @@
# graphiti-core Patches — FalkorDB Vector Index Support
Vendored patches against graphiti-core 0.29.0 adding native FalkorDB
vector index support. Three files modified, all under
`graphiti_core/driver/falkordb/` and `graphiti_core/graph_queries.py`.
No changes to Neo4j or Kuzu code paths.
## Why this exists
graphiti-core's FalkorDB driver uses interpreted Cypher cosine math
(`vec.cosineDistance(...)`) for similarity search. Each query becomes a
full table scan over Entity/RELATES_TO/Community nodes. At ~4,000+
entities, single-episode ingest's resolve-against-existing-graph step
takes 8+ minutes and bulk ingest hangs FalkorDB. FalkorDB itself
supports `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
procedures backed by HNSW indexes; graphiti-core's driver doesn't use
them.
These patches:
1. Add `get_vector_indices()` to `graph_queries.py` returning CREATE
VECTOR INDEX statements for FalkorDB on Entity.name_embedding,
RELATES_TO.fact_embedding, and Community.name_embedding.
2. Extend `falkordb_driver.py:build_indices_and_constraints()` to create
the vector indexes alongside range and fulltext indexes.
3. Rewrite the three vector-similarity call sites in
`falkordb/operations/search_ops.py` to use
`db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
instead of full-scan cosine math. Over-fetches by a configurable
multiplier to handle filter rejections.
## Files
| Patched file | Source |
|---|---|
| `graphiti_core/graph_queries.py` | Adds `get_vector_indices()` |
| `graphiti_core/driver/falkordb/falkordb_driver.py` | Extends `build_indices_and_constraints` |
| `graphiti_core/driver/falkordb/operations/search_ops.py` | Three query rewrites |
## How to apply
`./apply.sh` — backs up the originals into `./backups/<timestamp>/`
and copies the patched files over.
## How to revert
Move the timestamped backup back over the venv:
cp backups/<ts>/graph_queries.py /home/aaron/aaronai/venv/lib/python3.12/site-packages/graphiti_core/graph_queries.py
# ...etc
## Upstream candidate
Documented gap (issue #1263 references it indirectly via vector store
overlay RFC). Maintainers' attention is on Milvus/external vector DB
overlay; this patch is the FalkorDB-native alternative for users who
don't want a separate vector DB. Consider PR after empirical validation
in production.
+77
View File
@@ -0,0 +1,77 @@
#!/usr/bin/env bash
# apply.sh — Apply the BirdAI vendored graphiti-core patches.
#
# Backs up the original venv files into ./backups/<timestamp>/ before
# overwriting. The backup directory layout mirrors the venv layout so a
# revert is just a tree copy back.
#
# Usage: ./apply.sh
set -euo pipefail
PATCH_DIR="$(cd "$(dirname "$0")" && pwd)"
VENV_BASE="/home/aaron/aaronai/venv/lib/python3.12/site-packages"
TIMESTAMP="$(date +%Y%m%d-%H%M%S)"
BACKUP_DIR="$PATCH_DIR/backups/$TIMESTAMP"
# Files to patch — paths relative to graphiti_core/.
FILES=(
"graph_queries.py"
"driver/falkordb_driver.py"
"driver/falkordb/operations/search_ops.py"
)
echo "graphiti-core vendored patch apply — BirdAI"
echo "Patch directory: $PATCH_DIR"
echo "Venv target: $VENV_BASE/graphiti_core/"
echo "Backup to: $BACKUP_DIR"
echo
# Pre-flight: confirm all source patch files exist.
for rel in "${FILES[@]}"; do
if [ ! -f "$PATCH_DIR/graphiti_core/$rel" ]; then
echo "ERROR: missing patch file: $PATCH_DIR/graphiti_core/$rel" >&2
exit 1
fi
done
# Pre-flight: confirm all target venv files exist.
for rel in "${FILES[@]}"; do
if [ ! -f "$VENV_BASE/graphiti_core/$rel" ]; then
echo "ERROR: missing venv file: $VENV_BASE/graphiti_core/$rel" >&2
echo " graphiti-core may not be installed, or version differs from 0.29.0." >&2
exit 1
fi
done
# Backup originals.
echo "[1/3] Backing up originals..."
for rel in "${FILES[@]}"; do
backup_path="$BACKUP_DIR/graphiti_core/$rel"
mkdir -p "$(dirname "$backup_path")"
cp "$VENV_BASE/graphiti_core/$rel" "$backup_path"
echo " backed up: $rel"
done
echo
# Apply patches by copying.
echo "[2/3] Applying patches..."
for rel in "${FILES[@]}"; do
cp "$PATCH_DIR/graphiti_core/$rel" "$VENV_BASE/graphiti_core/$rel"
echo " patched: $rel"
done
echo
# Sanity check: confirm patched files have the marker.
echo "[3/3] Verifying patched files..."
for rel in "${FILES[@]}"; do
if grep -q "PATCHED 2026-05-02" "$VENV_BASE/graphiti_core/$rel"; then
echo " OK: $rel contains patch marker"
else
echo " WARNING: $rel missing patch marker (may be expected for graph_queries.py — its docstring uses the marker only in the module header)"
fi
done
echo
echo "Done. Backup: $BACKUP_DIR"
echo "Restart the sidecar to pick up changes:"
echo " sudo systemctl restart aaronai-graphiti.service"
@@ -0,0 +1,904 @@
"""
Copyright 2024, Zep Software, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import logging
from typing import Any
from graphiti_core.driver.driver import GraphProvider
from graphiti_core.driver.falkordb import STOPWORDS
from graphiti_core.driver.operations.search_ops import SearchOperations
from graphiti_core.driver.query_executor import QueryExecutor
from graphiti_core.driver.record_parsers import (
community_node_from_record,
entity_edge_from_record,
entity_node_from_record,
episodic_node_from_record,
)
from graphiti_core.edges import EntityEdge
from graphiti_core.graph_queries import (
get_nodes_query,
get_relationships_query,
get_vector_cosine_func_query,
)
from graphiti_core.models.edges.edge_db_queries import get_entity_edge_return_query
from graphiti_core.models.nodes.node_db_queries import (
COMMUNITY_NODE_RETURN,
EPISODIC_NODE_RETURN,
get_entity_node_return_query,
)
from graphiti_core.nodes import CommunityNode, EntityNode, EpisodicNode
from graphiti_core.search.search_filters import (
SearchFilters,
edge_search_filter_query_constructor,
node_search_filter_query_constructor,
)
logger = logging.getLogger(__name__)
MAX_QUERY_LENGTH = 128
# ---------------------------------------------------------------------------
# Vector index dispatcher (PATCHED 2026-05-02, BirdAI vendored patch).
#
# graphiti-core's FalkorDB driver historically composed similarity queries
# using `vec.cosineDistance(...)` in interpreted Cypher, which produces a
# full-table scan for every search. FalkorDB supports native vector indexes
# via `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`;
# this dispatcher uses them when present and falls back to the cosine math
# otherwise.
#
# Index existence is checked once per (label, attribute, entity_type) and
# cached at module scope. The cache should be invalidated whenever
# `build_indices_and_constraints` runs (since indexes may have been created
# or dropped). FalkorDriver.build_indices_and_constraints is patched to
# call `_invalidate_falkordb_vector_index_cache()` after building.
#
# Over-fetch factor (VECTOR_INDEX_CANDIDATE_MULTIPLIER from graph_queries)
# preserves recall when WHERE filters reject some of the top-k candidates.
# ---------------------------------------------------------------------------
from graphiti_core.graph_queries import (
VECTOR_INDEX_CANDIDATE_MULTIPLIER,
get_vector_cosine_func_query,
)
# Cache: key = (label, attribute, entity_type), value = bool
# entity_type is 'NODE' or 'RELATIONSHIP'.
_FALKORDB_VECTOR_INDEX_CACHE: dict[tuple[str, str, str], bool] = {}
def _invalidate_falkordb_vector_index_cache() -> None:
"""Clear the vector-index existence cache. Call after build_indices_and_constraints."""
_FALKORDB_VECTOR_INDEX_CACHE.clear()
async def _falkordb_vector_index_exists(
executor: QueryExecutor,
label: str,
attribute: str,
entity_type: str,
) -> bool:
"""Check whether a FalkorDB vector index exists for the given target.
entity_type is 'NODE' for node-label indexes, 'RELATIONSHIP' for edge-type indexes.
Result is cached at module scope; call _invalidate_falkordb_vector_index_cache()
after building or dropping indexes.
"""
key = (label, attribute, entity_type)
if key in _FALKORDB_VECTOR_INDEX_CACHE:
return _FALKORDB_VECTOR_INDEX_CACHE[key]
try:
records, _, _ = await executor.execute_query(
"CALL db.indexes() YIELD label, properties, types, entitytype "
"RETURN label, properties, types, entitytype"
)
except Exception as e:
# If we cannot enumerate indexes, fall back to "no index" rather than
# propagating the error. The fallback cosine-math path is correct,
# just slower.
logger.warning(f"FalkorDB vector index probe failed; assuming none exist: {e}")
_FALKORDB_VECTOR_INDEX_CACHE[key] = False
return False
found = False
for r in records:
# Records come back as dict-like rows keyed by column name (not
# tuples). Access by string keys matching the YIELD clause above.
rec_label = r.get('label') if hasattr(r, 'get') else r['label']
rec_props = r.get('properties') if hasattr(r, 'get') else r['properties']
rec_types = r.get('types') if hasattr(r, 'get') else r['types']
rec_entitytype = r.get('entitytype') if hasattr(r, 'get') else r['entitytype']
if rec_props is None:
rec_props = []
if rec_types is None:
rec_types = {}
if rec_label != label:
continue
if rec_entitytype is not None and rec_entitytype != entity_type:
continue
if attribute not in rec_props:
continue
# rec_types is a dict like {attribute: ['VECTOR', ...], ...} or sometimes
# a flat list — handle both shapes.
if isinstance(rec_types, dict):
attr_types = rec_types.get(attribute, [])
else:
attr_types = rec_types
if 'VECTOR' in attr_types:
found = True
break
_FALKORDB_VECTOR_INDEX_CACHE[key] = found
return found
def _falkordb_vector_node_search_cypher(
label: str,
embedding_attr: str,
search_vector_param: str,
use_index: bool,
) -> tuple[str, str]:
"""Build the cypher prefix and node-binding for a node-vector search.
Returns (prefix, node_var) where:
- prefix is the Cypher fragment that binds the node variable and a
`score` variable. With index, it's a CALL ... YIELD; without, it's
a MATCH plus WITH cosine math.
- node_var is the variable name the caller's downstream Cypher should
reference (always 'n' here for parity with the existing code).
The caller appends WHERE filters and RETURN/ORDER BY/LIMIT as usual.
The over-fetch parameter `$candidate_k` must be passed by the caller
when use_index is True.
"""
if use_index:
return (
f"CALL db.idx.vector.queryNodes("
f"'{label}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
f") YIELD node, score "
f"WITH node AS n, score "
), "n"
# Fallback: original cosine math path
cosine = get_vector_cosine_func_query(
f"n.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
)
return (
f"MATCH (n:{label}) "
f"WITH n, {cosine} AS score "
), "n"
def _falkordb_vector_edge_search_cypher(
relationship_type: str,
embedding_attr: str,
search_vector_param: str,
use_index: bool,
) -> tuple[str, str]:
"""Build the cypher prefix and edge-binding for an edge-vector search.
Returns (prefix, edge_var). With the index, the procedure binds the
relationship variable; we then MATCH source and target via the existing
edge to recover (n)-[e]->(m). Without the index, it's the original
MATCH-and-cosine path.
Variable name is 'e' for parity with existing code; source/target are
'n' and 'm' respectively, also for parity.
"""
if use_index:
return (
f"CALL db.idx.vector.queryRelationships("
f"'{relationship_type}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
f") YIELD relationship, score "
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
f"WHERE e = relationship "
f"WITH DISTINCT e, n, m, score "
), "e"
# Fallback
cosine = get_vector_cosine_func_query(
f"e.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
)
return (
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
f"WITH DISTINCT e, n, m, {cosine} AS score "
), "e"
# FalkorDB separator characters that break text into tokens
_SEPARATOR_MAP = str.maketrans(
{
',': ' ',
'.': ' ',
'<': ' ',
'>': ' ',
'{': ' ',
'}': ' ',
'[': ' ',
']': ' ',
'"': ' ',
"'": ' ',
':': ' ',
';': ' ',
'!': ' ',
'@': ' ',
'#': ' ',
'$': ' ',
'%': ' ',
'^': ' ',
'&': ' ',
'*': ' ',
'(': ' ',
')': ' ',
'-': ' ',
'+': ' ',
'=': ' ',
'~': ' ',
'?': ' ',
'|': ' ',
'/': ' ',
'\\': ' ',
}
)
def _sanitize(query: str) -> str:
"""Replace FalkorDB special characters with whitespace."""
sanitized = query.translate(_SEPARATOR_MAP)
return ' '.join(sanitized.split())
def _build_falkor_fulltext_query(
query: str,
group_ids: list[str] | None = None,
max_query_length: int = MAX_QUERY_LENGTH,
) -> str:
"""Build a fulltext query string for FalkorDB using RedisSearch syntax."""
if group_ids is None or len(group_ids) == 0:
group_filter = ''
else:
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
group_values = '|'.join(escaped_group_ids)
group_filter = f'(@group_id:{group_values})'
sanitized_query = _sanitize(query)
# Remove stopwords and empty tokens
query_words = sanitized_query.split()
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
sanitized_query = ' | '.join(filtered_words)
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
return ''
full_query = group_filter + ' (' + sanitized_query + ')'
return full_query
class FalkorSearchOperations(SearchOperations):
# --- Node search ---
async def node_fulltext_search(
self,
executor: QueryExecutor,
query: str,
search_filter: SearchFilters,
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[EntityNode]:
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
if fuzzy_query == '':
return []
filter_queries, filter_params = node_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('n.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
filter_query = ''
if filter_queries:
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
cypher = (
get_nodes_query(
'node_name_and_summary', '$query', limit=limit, provider=GraphProvider.FALKORDB
)
+ 'YIELD node AS n, score'
+ filter_query
+ """
WITH n, score
ORDER BY score DESC
LIMIT $limit
RETURN
"""
+ get_entity_node_return_query(GraphProvider.FALKORDB)
)
records, _, _ = await executor.execute_query(
cypher,
query=fuzzy_query,
limit=limit,
**filter_params,
)
return [entity_node_from_record(r) for r in records]
async def node_similarity_search(
self,
executor: QueryExecutor,
search_vector: list[float],
search_filter: SearchFilters,
group_ids: list[str] | None = None,
limit: int = 10,
min_score: float = 0.6,
) -> list[EntityNode]:
filter_queries, filter_params = node_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('n.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
filter_query = ''
if filter_queries:
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
# index when available; fall back to interpreted-Cypher cosine math
# otherwise. The filter clause's position changes between paths
# (after MATCH for fallback, after YIELD for index path), but the
# filter expressions themselves are identical because they reference
# the bound variable `n` either way.
use_index = await _falkordb_vector_index_exists(
executor, 'Entity', 'name_embedding', 'NODE'
)
prefix, _ = _falkordb_vector_node_search_cypher(
'Entity', 'name_embedding', '$search_vector', use_index
)
where_clauses = []
if filter_query:
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
where_clauses.append('score > $min_score')
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
cypher = (
prefix
+ unified_where
+ """
RETURN
"""
+ get_entity_node_return_query(GraphProvider.FALKORDB)
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
params = dict(
search_vector=search_vector,
limit=limit,
min_score=min_score,
**filter_params,
)
if use_index:
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
records, _, _ = await executor.execute_query(cypher, **params)
return [entity_node_from_record(r) for r in records]
async def node_bfs_search(
self,
executor: QueryExecutor,
origin_uuids: list[str],
search_filter: SearchFilters,
max_depth: int,
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[EntityNode]:
if not origin_uuids or max_depth < 1:
return []
filter_queries, filter_params = node_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('n.group_id IN $group_ids')
filter_queries.append('origin.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
filter_query = ''
if filter_queries:
filter_query = ' AND ' + (' AND '.join(filter_queries))
cypher = (
f"""
UNWIND $bfs_origin_node_uuids AS origin_uuid
MATCH (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(n:Entity)
WHERE n.group_id = origin.group_id
"""
+ filter_query
+ """
RETURN
"""
+ get_entity_node_return_query(GraphProvider.FALKORDB)
+ """
LIMIT $limit
"""
)
records, _, _ = await executor.execute_query(
cypher,
bfs_origin_node_uuids=origin_uuids,
limit=limit,
**filter_params,
)
return [entity_node_from_record(r) for r in records]
# --- Edge search ---
async def edge_fulltext_search(
self,
executor: QueryExecutor,
query: str,
search_filter: SearchFilters,
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[EntityEdge]:
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
if fuzzy_query == '':
return []
filter_queries, filter_params = edge_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('e.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
filter_query = ''
if filter_queries:
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
cypher = (
get_relationships_query(
'edge_name_and_fact', limit=limit, provider=GraphProvider.FALKORDB
)
+ """
YIELD relationship AS rel, score
MATCH (n:Entity)-[e:RELATES_TO {uuid: rel.uuid}]->(m:Entity)
"""
+ filter_query
+ """
WITH e, score, n, m
RETURN
"""
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
records, _, _ = await executor.execute_query(
cypher,
query=fuzzy_query,
limit=limit,
**filter_params,
)
return [entity_edge_from_record(r) for r in records]
async def edge_similarity_search(
self,
executor: QueryExecutor,
search_vector: list[float],
source_node_uuid: str | None,
target_node_uuid: str | None,
search_filter: SearchFilters,
group_ids: list[str] | None = None,
limit: int = 10,
min_score: float = 0.6,
) -> list[EntityEdge]:
filter_queries, filter_params = edge_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('e.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
if source_node_uuid is not None:
filter_params['source_uuid'] = source_node_uuid
filter_queries.append('n.uuid = $source_uuid')
if target_node_uuid is not None:
filter_params['target_uuid'] = target_node_uuid
filter_queries.append('m.uuid = $target_uuid')
filter_query = ''
if filter_queries:
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
# index on RELATES_TO.fact_embedding when available. The unindexed
# fallback is the same MATCH-and-cosine math that previously hung
# for 6+ minutes on a 4,000-entity graph; this is the load-bearing
# call site that motivated the patch.
use_index = await _falkordb_vector_index_exists(
executor, 'RELATES_TO', 'fact_embedding', 'RELATIONSHIP'
)
prefix, _ = _falkordb_vector_edge_search_cypher(
'RELATES_TO', 'fact_embedding', '$search_vector', use_index
)
where_clauses = []
if filter_query:
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
where_clauses.append('score > $min_score')
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
cypher = (
prefix
+ unified_where
+ """
RETURN
"""
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
params = dict(
search_vector=search_vector,
limit=limit,
min_score=min_score,
**filter_params,
)
if use_index:
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
records, _, _ = await executor.execute_query(cypher, **params)
return [entity_edge_from_record(r) for r in records]
async def edge_bfs_search(
self,
executor: QueryExecutor,
origin_uuids: list[str],
max_depth: int,
search_filter: SearchFilters,
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[EntityEdge]:
if not origin_uuids:
return []
filter_queries, filter_params = edge_search_filter_query_constructor(
search_filter, GraphProvider.FALKORDB
)
if group_ids is not None:
filter_queries.append('e.group_id IN $group_ids')
filter_params['group_ids'] = group_ids
filter_query = ''
if filter_queries:
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
cypher = (
f"""
UNWIND $bfs_origin_node_uuids AS origin_uuid
MATCH path = (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(:Entity)
UNWIND relationships(path) AS rel
MATCH (n:Entity)-[e:RELATES_TO {{uuid: rel.uuid}}]-(m:Entity)
"""
+ filter_query
+ """
RETURN DISTINCT
"""
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
+ """
LIMIT $limit
"""
)
records, _, _ = await executor.execute_query(
cypher,
bfs_origin_node_uuids=origin_uuids,
depth=max_depth,
limit=limit,
**filter_params,
)
return [entity_edge_from_record(r) for r in records]
# --- Episode search ---
async def episode_fulltext_search(
self,
executor: QueryExecutor,
query: str,
search_filter: SearchFilters, # noqa: ARG002
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[EpisodicNode]:
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
if fuzzy_query == '':
return []
filter_params: dict[str, Any] = {}
group_filter_query = ''
if group_ids is not None:
group_filter_query += '\nAND e.group_id IN $group_ids'
filter_params['group_ids'] = group_ids
cypher = (
get_nodes_query(
'episode_content', '$query', limit=limit, provider=GraphProvider.FALKORDB
)
+ """
YIELD node AS episode, score
MATCH (e:Episodic)
WHERE e.uuid = episode.uuid
"""
+ group_filter_query
+ """
RETURN
"""
+ EPISODIC_NODE_RETURN
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
records, _, _ = await executor.execute_query(
cypher, query=fuzzy_query, limit=limit, **filter_params
)
return [episodic_node_from_record(r) for r in records]
# --- Community search ---
async def community_fulltext_search(
self,
executor: QueryExecutor,
query: str,
group_ids: list[str] | None = None,
limit: int = 10,
) -> list[CommunityNode]:
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
if fuzzy_query == '':
return []
filter_params: dict[str, Any] = {}
group_filter_query = ''
if group_ids is not None:
group_filter_query = 'WHERE c.group_id IN $group_ids'
filter_params['group_ids'] = group_ids
cypher = (
get_nodes_query(
'community_name', '$query', limit=limit, provider=GraphProvider.FALKORDB
)
+ """
YIELD node AS c, score
WITH c, score
"""
+ group_filter_query
+ """
RETURN
"""
+ COMMUNITY_NODE_RETURN
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
records, _, _ = await executor.execute_query(
cypher, query=fuzzy_query, limit=limit, **filter_params
)
return [community_node_from_record(r) for r in records]
async def community_similarity_search(
self,
executor: QueryExecutor,
search_vector: list[float],
group_ids: list[str] | None = None,
limit: int = 10,
min_score: float = 0.6,
) -> list[CommunityNode]:
query_params: dict[str, Any] = {}
group_filter_query = ''
if group_ids is not None:
group_filter_query += ' WHERE c.group_id IN $group_ids'
query_params['group_ids'] = group_ids
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
# index on Community.name_embedding when available. Note: the existing
# filter is built into `group_filter_query` (already prefixed with
# ' WHERE ' if non-empty) and uses variable `c`. The dispatcher binds
# the node as `n` for parity with the helper signature, then we
# re-bind to `c` via WITH so the rest of the query is unchanged.
use_index = await _falkordb_vector_index_exists(
executor, 'Community', 'name_embedding', 'NODE'
)
prefix, _ = _falkordb_vector_node_search_cypher(
'Community', 'name_embedding', '$search_vector', use_index
)
prefix = prefix + ' WITH n AS c, score '
where_clauses = []
if group_filter_query:
where_clauses.append(group_filter_query.replace(' WHERE ', '', 1).strip())
where_clauses.append('score > $min_score')
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
cypher = (
prefix
+ unified_where
+ """
RETURN
"""
+ COMMUNITY_NODE_RETURN
+ """
ORDER BY score DESC
LIMIT $limit
"""
)
params = dict(
search_vector=search_vector,
limit=limit,
min_score=min_score,
**query_params,
)
if use_index:
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
records, _, _ = await executor.execute_query(cypher, **params)
return [community_node_from_record(r) for r in records]
# --- Rerankers ---
async def node_distance_reranker(
self,
executor: QueryExecutor,
node_uuids: list[str],
center_node_uuid: str,
min_score: float = 0,
) -> list[EntityNode]:
filtered_uuids = [u for u in node_uuids if u != center_node_uuid]
scores: dict[str, float] = {center_node_uuid: 0.0}
cypher = """
UNWIND $node_uuids AS node_uuid
MATCH (center:Entity {uuid: $center_uuid})-[:RELATES_TO]-(n:Entity {uuid: node_uuid})
RETURN 1 AS score, node_uuid AS uuid
"""
results, _, _ = await executor.execute_query(
cypher,
node_uuids=filtered_uuids,
center_uuid=center_node_uuid,
)
for result in results:
scores[result['uuid']] = result['score']
for uuid in filtered_uuids:
if uuid not in scores:
scores[uuid] = float('inf')
filtered_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
if center_node_uuid in node_uuids:
scores[center_node_uuid] = 0.1
filtered_uuids = [center_node_uuid] + filtered_uuids
reranked_uuids = [u for u in filtered_uuids if (1 / scores[u]) >= min_score]
if not reranked_uuids:
return []
get_query = """
MATCH (n:Entity)
WHERE n.uuid IN $uuids
RETURN
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
return [node_map[u] for u in reranked_uuids if u in node_map]
async def episode_mentions_reranker(
self,
executor: QueryExecutor,
node_uuids: list[str],
min_score: float = 0,
) -> list[EntityNode]:
if not node_uuids:
return []
scores: dict[str, float] = {}
results, _, _ = await executor.execute_query(
"""
UNWIND $node_uuids AS node_uuid
MATCH (episode:Episodic)-[r:MENTIONS]->(n:Entity {uuid: node_uuid})
RETURN count(*) AS score, n.uuid AS uuid
""",
node_uuids=node_uuids,
)
for result in results:
scores[result['uuid']] = result['score']
for uuid in node_uuids:
if uuid not in scores:
scores[uuid] = float('inf')
sorted_uuids = list(node_uuids)
sorted_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
reranked_uuids = [u for u in sorted_uuids if scores[u] >= min_score]
if not reranked_uuids:
return []
get_query = """
MATCH (n:Entity)
WHERE n.uuid IN $uuids
RETURN
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
return [node_map[u] for u in reranked_uuids if u in node_map]
# --- Filter builders ---
def build_node_search_filters(self, search_filters: SearchFilters) -> Any:
filter_queries, filter_params = node_search_filter_query_constructor(
search_filters, GraphProvider.FALKORDB
)
return {'filter_queries': filter_queries, 'filter_params': filter_params}
def build_edge_search_filters(self, search_filters: SearchFilters) -> Any:
filter_queries, filter_params = edge_search_filter_query_constructor(
search_filters, GraphProvider.FALKORDB
)
return {'filter_queries': filter_queries, 'filter_params': filter_params}
# --- Fulltext query builder ---
def build_fulltext_query(
self,
query: str,
group_ids: list[str] | None = None,
max_query_length: int = MAX_QUERY_LENGTH,
) -> str:
return _build_falkor_fulltext_query(query, group_ids, max_query_length)
@@ -0,0 +1,444 @@
"""
Copyright 2024, Zep Software, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import asyncio
import datetime
import logging
from typing import TYPE_CHECKING, Any
if TYPE_CHECKING:
from falkordb import Graph as FalkorGraph
from falkordb.asyncio import FalkorDB
else:
try:
from falkordb import Graph as FalkorGraph
from falkordb.asyncio import FalkorDB
except ImportError:
# If falkordb is not installed, raise an ImportError
raise ImportError(
'falkordb is required for FalkorDriver. '
'Install it with: pip install graphiti-core[falkordb]'
) from None
from graphiti_core.driver.driver import GraphDriver, GraphDriverSession, GraphProvider
from graphiti_core.driver.falkordb import STOPWORDS as STOPWORDS
from graphiti_core.driver.falkordb.operations.community_edge_ops import (
FalkorCommunityEdgeOperations,
)
from graphiti_core.driver.falkordb.operations.community_node_ops import (
FalkorCommunityNodeOperations,
)
from graphiti_core.driver.falkordb.operations.entity_edge_ops import FalkorEntityEdgeOperations
from graphiti_core.driver.falkordb.operations.entity_node_ops import FalkorEntityNodeOperations
from graphiti_core.driver.falkordb.operations.episode_node_ops import FalkorEpisodeNodeOperations
from graphiti_core.driver.falkordb.operations.episodic_edge_ops import FalkorEpisodicEdgeOperations
from graphiti_core.driver.falkordb.operations.graph_ops import FalkorGraphMaintenanceOperations
from graphiti_core.driver.falkordb.operations.has_episode_edge_ops import (
FalkorHasEpisodeEdgeOperations,
)
from graphiti_core.driver.falkordb.operations.next_episode_edge_ops import (
FalkorNextEpisodeEdgeOperations,
)
from graphiti_core.driver.falkordb.operations.saga_node_ops import FalkorSagaNodeOperations
from graphiti_core.driver.falkordb.operations.search_ops import FalkorSearchOperations
from graphiti_core.driver.operations.community_edge_ops import CommunityEdgeOperations
from graphiti_core.driver.operations.community_node_ops import CommunityNodeOperations
from graphiti_core.driver.operations.entity_edge_ops import EntityEdgeOperations
from graphiti_core.driver.operations.entity_node_ops import EntityNodeOperations
from graphiti_core.driver.operations.episode_node_ops import EpisodeNodeOperations
from graphiti_core.driver.operations.episodic_edge_ops import EpisodicEdgeOperations
from graphiti_core.driver.operations.graph_ops import GraphMaintenanceOperations
from graphiti_core.driver.operations.has_episode_edge_ops import HasEpisodeEdgeOperations
from graphiti_core.driver.operations.next_episode_edge_ops import NextEpisodeEdgeOperations
from graphiti_core.driver.operations.saga_node_ops import SagaNodeOperations
from graphiti_core.driver.operations.search_ops import SearchOperations
from graphiti_core.graph_queries import get_fulltext_indices, get_range_indices, get_vector_indices
from graphiti_core.helpers import validate_group_ids
from graphiti_core.utils.datetime_utils import convert_datetimes_to_strings
logger = logging.getLogger(__name__)
class FalkorDriverSession(GraphDriverSession):
provider = GraphProvider.FALKORDB
def __init__(self, graph: FalkorGraph):
self.graph = graph
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
# No cleanup needed for Falkor, but method must exist
pass
async def close(self):
# No explicit close needed for FalkorDB, but method must exist
pass
async def execute_write(self, func, *args, **kwargs):
# Directly await the provided async function with `self` as the transaction/session
return await func(self, *args, **kwargs)
async def run(self, query: str | list, **kwargs: Any) -> Any:
# FalkorDB does not support argument for Label Set, so it's converted into an array of queries
if isinstance(query, list):
for cypher, params in query:
params = convert_datetimes_to_strings(params)
await self.graph.query(str(cypher), params) # type: ignore[reportUnknownArgumentType]
else:
params = dict(kwargs)
params = convert_datetimes_to_strings(params)
await self.graph.query(str(query), params) # type: ignore[reportUnknownArgumentType]
# Assuming `graph.query` is async (ideal); otherwise, wrap in executor
return None
class FalkorDriver(GraphDriver):
provider = GraphProvider.FALKORDB
default_group_id: str = '\\_'
fulltext_syntax: str = '@' # FalkorDB uses a redisearch-like syntax for fulltext queries
aoss_client: None = None
def __init__(
self,
host: str = 'localhost',
port: int = 6379,
username: str | None = None,
password: str | None = None,
falkor_db: FalkorDB | None = None,
database: str = 'default_db',
):
"""
Initialize the FalkorDB driver.
FalkorDB is a multi-tenant graph database.
To connect, provide the host and port.
The default parameters assume a local (on-premises) FalkorDB instance.
Args:
host (str): The host where FalkorDB is running.
port (int): The port on which FalkorDB is listening.
username (str | None): The username for authentication (if required).
password (str | None): The password for authentication (if required).
falkor_db (FalkorDB | None): An existing FalkorDB instance to use instead of creating a new one.
database (str): The name of the database to connect to. Defaults to 'default_db'.
"""
super().__init__()
self._database = database
if falkor_db is not None:
# If a FalkorDB instance is provided, use it directly
self.client = falkor_db
else:
self.client = FalkorDB(host=host, port=port, username=username, password=password)
# Instantiate FalkorDB operations
self._entity_node_ops = FalkorEntityNodeOperations()
self._episode_node_ops = FalkorEpisodeNodeOperations()
self._community_node_ops = FalkorCommunityNodeOperations()
self._saga_node_ops = FalkorSagaNodeOperations()
self._entity_edge_ops = FalkorEntityEdgeOperations()
self._episodic_edge_ops = FalkorEpisodicEdgeOperations()
self._community_edge_ops = FalkorCommunityEdgeOperations()
self._has_episode_edge_ops = FalkorHasEpisodeEdgeOperations()
self._next_episode_edge_ops = FalkorNextEpisodeEdgeOperations()
self._search_ops = FalkorSearchOperations()
self._graph_ops = FalkorGraphMaintenanceOperations()
# Schedule the indices and constraints to be built
try:
# Try to get the current event loop
loop = asyncio.get_running_loop()
# Schedule the build_indices_and_constraints to run
loop.create_task(self.build_indices_and_constraints())
except RuntimeError:
# No event loop running, this will be handled later
pass
# --- Operations properties ---
@property
def entity_node_ops(self) -> EntityNodeOperations:
return self._entity_node_ops
@property
def episode_node_ops(self) -> EpisodeNodeOperations:
return self._episode_node_ops
@property
def community_node_ops(self) -> CommunityNodeOperations:
return self._community_node_ops
@property
def saga_node_ops(self) -> SagaNodeOperations:
return self._saga_node_ops
@property
def entity_edge_ops(self) -> EntityEdgeOperations:
return self._entity_edge_ops
@property
def episodic_edge_ops(self) -> EpisodicEdgeOperations:
return self._episodic_edge_ops
@property
def community_edge_ops(self) -> CommunityEdgeOperations:
return self._community_edge_ops
@property
def has_episode_edge_ops(self) -> HasEpisodeEdgeOperations:
return self._has_episode_edge_ops
@property
def next_episode_edge_ops(self) -> NextEpisodeEdgeOperations:
return self._next_episode_edge_ops
@property
def search_ops(self) -> SearchOperations:
return self._search_ops
@property
def graph_ops(self) -> GraphMaintenanceOperations:
return self._graph_ops
def _get_graph(self, graph_name: str | None) -> FalkorGraph:
# FalkorDB requires a non-None database name for multi-tenant graphs; the default is "default_db"
if graph_name is None:
graph_name = self._database
return self.client.select_graph(graph_name)
async def execute_query(self, cypher_query_, **kwargs: Any):
graph = self._get_graph(self._database)
# Convert datetime objects to ISO strings (FalkorDB does not support datetime objects directly)
params = convert_datetimes_to_strings(dict(kwargs))
try:
result = await graph.query(cypher_query_, params) # type: ignore[reportUnknownArgumentType]
except Exception as e:
if 'already indexed' in str(e):
# check if index already exists
logger.info(f'Index already exists: {e}')
return None
logger.error(f'Error executing FalkorDB query: {e}\n{cypher_query_}\n{params}')
raise
# Convert the result header to a list of strings
header = [h[1] for h in result.header]
# Convert FalkorDB's result format (list of lists) to the format expected by Graphiti (list of dicts)
records = []
for row in result.result_set:
record = {}
for i, field_name in enumerate(header):
if i < len(row):
record[field_name] = row[i]
else:
# If there are more fields in header than values in row, set to None
record[field_name] = None
records.append(record)
return records, header, None
def session(self, database: str | None = None) -> GraphDriverSession:
return FalkorDriverSession(self._get_graph(database))
async def close(self) -> None:
"""Close the driver connection."""
if hasattr(self.client, 'aclose'):
await self.client.aclose() # type: ignore[reportUnknownMemberType]
elif hasattr(self.client.connection, 'aclose'):
await self.client.connection.aclose()
elif hasattr(self.client.connection, 'close'):
await self.client.connection.close()
async def delete_all_indexes(self) -> None:
result = await self.execute_query('CALL db.indexes()')
if not result:
return
records, _, _ = result
drop_tasks = []
for record in records:
label = record['label']
entity_type = record['entitytype']
for field_name, index_type in record['types'].items():
if 'RANGE' in index_type:
drop_tasks.append(self.execute_query(f'DROP INDEX ON :{label}({field_name})'))
elif 'FULLTEXT' in index_type:
if entity_type == 'NODE':
drop_tasks.append(
self.execute_query(
f'DROP FULLTEXT INDEX FOR (n:{label}) ON (n.{field_name})'
)
)
elif entity_type == 'RELATIONSHIP':
drop_tasks.append(
self.execute_query(
f'DROP FULLTEXT INDEX FOR ()-[e:{label}]-() ON (e.{field_name})'
)
)
if drop_tasks:
await asyncio.gather(*drop_tasks)
async def build_indices_and_constraints(self, delete_existing=False):
if delete_existing:
await self.delete_all_indexes()
# PATCHED 2026-05-02 (BirdAI vendored patch): add vector indexes alongside
# range and fulltext. FalkorDB supports native vector indexes via
# db.idx.vector.queryNodes / queryRelationships; without these, similarity
# search runs as full-table-scan cosine math in interpreted Cypher.
index_queries = (
get_range_indices(self.provider)
+ get_fulltext_indices(self.provider)
+ get_vector_indices(self.provider)
)
for query in index_queries:
await self.execute_query(query)
# Invalidate the search_ops vector-index existence cache so subsequent
# similarity queries re-probe and discover the indexes we just built.
try:
from graphiti_core.driver.falkordb.operations.search_ops import (
_invalidate_falkordb_vector_index_cache,
)
_invalidate_falkordb_vector_index_cache()
except ImportError:
# search_ops module not yet imported (cold start); cache is empty
# by default, so no invalidation needed.
pass
def clone(self, database: str) -> 'GraphDriver':
"""
Returns a shallow copy of this driver with a different default database.
Reuses the same connection (e.g. FalkorDB, Neo4j).
"""
if database == self._database:
cloned = self
elif database == self.default_group_id:
cloned = FalkorDriver(falkor_db=self.client)
else:
# Create a new instance of FalkorDriver with the same connection but a different database
cloned = FalkorDriver(falkor_db=self.client, database=database)
return cloned
async def health_check(self) -> None:
"""Check FalkorDB connectivity by running a simple query."""
try:
await self.execute_query('MATCH (n) RETURN 1 LIMIT 1')
return None
except Exception as e:
print(f'FalkorDB health check failed: {e}')
raise
@staticmethod
def convert_datetimes_to_strings(obj):
if isinstance(obj, dict):
return {k: FalkorDriver.convert_datetimes_to_strings(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [FalkorDriver.convert_datetimes_to_strings(item) for item in obj]
elif isinstance(obj, tuple):
return tuple(FalkorDriver.convert_datetimes_to_strings(item) for item in obj)
elif isinstance(obj, datetime):
return obj.isoformat()
else:
return obj
def sanitize(self, query: str) -> str:
"""
Replace FalkorDB special characters with whitespace.
Based on FalkorDB tokenization rules: ,.<>{}[]"':;!@#$%^&*()-+=~
"""
# FalkorDB separator characters that break text into tokens
separator_map = str.maketrans(
{
',': ' ',
'.': ' ',
'<': ' ',
'>': ' ',
'{': ' ',
'}': ' ',
'[': ' ',
']': ' ',
'"': ' ',
"'": ' ',
':': ' ',
';': ' ',
'!': ' ',
'@': ' ',
'#': ' ',
'$': ' ',
'%': ' ',
'^': ' ',
'&': ' ',
'*': ' ',
'(': ' ',
')': ' ',
'-': ' ',
'+': ' ',
'=': ' ',
'~': ' ',
'?': ' ',
'|': ' ',
'/': ' ',
'\\': ' ',
}
)
sanitized = query.translate(separator_map)
# Clean up multiple spaces
sanitized = ' '.join(sanitized.split())
return sanitized
def build_fulltext_query(
self, query: str, group_ids: list[str] | None = None, max_query_length: int = 128
) -> str:
"""
Build a fulltext query string for FalkorDB using RedisSearch syntax.
FalkorDB uses RedisSearch-like syntax where:
- Field queries use @ prefix: @field:value
- Multiple values for same field: (@field:value1|value2)
- Text search doesn't need @ prefix for content fields
- AND is implicit with space: (@group_id:value) (text)
- OR uses pipe within parentheses: (@group_id:value1|value2)
"""
validate_group_ids(group_ids)
if group_ids is None or len(group_ids) == 0:
group_filter = ''
else:
# Escape group_ids with quotes to prevent RediSearch syntax errors
# with reserved words like "main" or special characters like hyphens
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
group_values = '|'.join(escaped_group_ids)
group_filter = f'(@group_id:{group_values})'
sanitized_query = self.sanitize(query)
# Remove stopwords and empty tokens from the sanitized query
query_words = sanitized_query.split()
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
sanitized_query = ' | '.join(filtered_words)
# If the query is too long return no query
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
return ''
full_query = group_filter + ' (' + sanitized_query + ')'
return full_query
@@ -0,0 +1,242 @@
"""
Database query utilities for different graph database backends.
This module provides database-agnostic query generation for Neo4j and FalkorDB,
supporting index creation, fulltext search, and bulk operations.
PATCHED for FalkorDB native vector index support (BirdAI vendored patch,
2026-05-02). Adds:
- get_vector_indices(): CREATE VECTOR INDEX statements for FalkorDB
- get_vector_search_query(): Cypher fragment for vector similarity using
FalkorDB's db.idx.vector procedures, with fallback to cosine math when
the index does not yet exist
- VECTOR_INDEX_CANDIDATE_MULTIPLIER: over-fetch factor for vector index
queries to handle filter rejections after index lookup
No changes to Neo4j or Kuzu code paths.
"""
from typing_extensions import LiteralString
from graphiti_core.driver.driver import GraphProvider
# Mapping from Neo4j fulltext index names to FalkorDB node labels
NEO4J_TO_FALKORDB_MAPPING = {
'node_name_and_summary': 'Entity',
'community_name': 'Community',
'episode_content': 'Episodic',
'edge_name_and_fact': 'RELATES_TO',
}
# Mapping from fulltext index names to Kuzu node labels
INDEX_TO_LABEL_KUZU_MAPPING = {
'node_name_and_summary': 'Entity',
'community_name': 'Community',
'episode_content': 'Episodic',
'edge_name_and_fact': 'RelatesToNode_',
}
# Vector index over-fetch multiplier. When a vector index search is
# combined with WHERE filters (group_id, source_uuid, etc.), some of
# the top-k index results may be filtered out. Over-fetching by this
# factor preserves recall against the final LIMIT after filtering.
# Conservative default; tunable per-deployment by editing this constant
# or via environment-variable override at the driver level (future).
VECTOR_INDEX_CANDIDATE_MULTIPLIER = 5
def get_range_indices(provider: GraphProvider) -> list[LiteralString]:
if provider == GraphProvider.FALKORDB:
return [
# Entity node
'CREATE INDEX FOR (n:Entity) ON (n.uuid, n.group_id, n.name, n.created_at)',
# Episodic node
'CREATE INDEX FOR (n:Episodic) ON (n.uuid, n.group_id, n.created_at, n.valid_at)',
# Community node
'CREATE INDEX FOR (n:Community) ON (n.uuid)',
# Saga node
'CREATE INDEX FOR (n:Saga) ON (n.uuid, n.group_id, n.name)',
# RELATES_TO edge
'CREATE INDEX FOR ()-[e:RELATES_TO]-() ON (e.uuid, e.group_id, e.name, e.created_at, e.expired_at, e.valid_at, e.invalid_at)',
# MENTIONS edge
'CREATE INDEX FOR ()-[e:MENTIONS]-() ON (e.uuid, e.group_id)',
# HAS_MEMBER edge
'CREATE INDEX FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
# HAS_EPISODE edge
'CREATE INDEX FOR ()-[e:HAS_EPISODE]-() ON (e.uuid, e.group_id)',
# NEXT_EPISODE edge
'CREATE INDEX FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid, e.group_id)',
]
if provider == GraphProvider.KUZU:
return []
return [
'CREATE INDEX entity_uuid IF NOT EXISTS FOR (n:Entity) ON (n.uuid)',
'CREATE INDEX episode_uuid IF NOT EXISTS FOR (n:Episodic) ON (n.uuid)',
'CREATE INDEX community_uuid IF NOT EXISTS FOR (n:Community) ON (n.uuid)',
'CREATE INDEX saga_uuid IF NOT EXISTS FOR (n:Saga) ON (n.uuid)',
'CREATE INDEX relation_uuid IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.uuid)',
'CREATE INDEX mention_uuid IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.uuid)',
'CREATE INDEX has_member_uuid IF NOT EXISTS FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
'CREATE INDEX has_episode_uuid IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.uuid)',
'CREATE INDEX next_episode_uuid IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid)',
'CREATE INDEX entity_group_id IF NOT EXISTS FOR (n:Entity) ON (n.group_id)',
'CREATE INDEX episode_group_id IF NOT EXISTS FOR (n:Episodic) ON (n.group_id)',
'CREATE INDEX community_group_id IF NOT EXISTS FOR (n:Community) ON (n.group_id)',
'CREATE INDEX saga_group_id IF NOT EXISTS FOR (n:Saga) ON (n.group_id)',
'CREATE INDEX relation_group_id IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.group_id)',
'CREATE INDEX mention_group_id IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.group_id)',
'CREATE INDEX has_episode_group_id IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.group_id)',
'CREATE INDEX next_episode_group_id IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.group_id)',
'CREATE INDEX name_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.name)',
'CREATE INDEX saga_name IF NOT EXISTS FOR (n:Saga) ON (n.name)',
'CREATE INDEX created_at_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.created_at)',
'CREATE INDEX created_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.created_at)',
'CREATE INDEX valid_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.valid_at)',
'CREATE INDEX name_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.name)',
'CREATE INDEX created_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.created_at)',
'CREATE INDEX expired_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.expired_at)',
'CREATE INDEX valid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.valid_at)',
'CREATE INDEX invalid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.invalid_at)',
]
def get_fulltext_indices(provider: GraphProvider) -> list[LiteralString]:
if provider == GraphProvider.FALKORDB:
from typing import cast
from graphiti_core.driver.falkordb import STOPWORDS
# Convert to string representation for embedding in queries
stopwords_str = str(STOPWORDS)
# Use type: ignore to satisfy LiteralString requirement while maintaining single source of truth
return cast(
list[LiteralString],
[
f"""CALL db.idx.fulltext.createNodeIndex(
{{
label: 'Episodic',
stopwords: {stopwords_str}
}},
'content', 'source', 'source_description', 'group_id'
)""",
f"""CALL db.idx.fulltext.createNodeIndex(
{{
label: 'Entity',
stopwords: {stopwords_str}
}},
'name', 'summary', 'group_id'
)""",
f"""CALL db.idx.fulltext.createNodeIndex(
{{
label: 'Community',
stopwords: {stopwords_str}
}},
'name', 'group_id'
)""",
"""CREATE FULLTEXT INDEX FOR ()-[e:RELATES_TO]-() ON (e.name, e.fact, e.group_id)""",
],
)
if provider == GraphProvider.KUZU:
return [
"CALL CREATE_FTS_INDEX('Episodic', 'episode_content', ['content', 'source', 'source_description']);",
"CALL CREATE_FTS_INDEX('Entity', 'node_name_and_summary', ['name', 'summary']);",
"CALL CREATE_FTS_INDEX('Community', 'community_name', ['name']);",
"CALL CREATE_FTS_INDEX('RelatesToNode_', 'edge_name_and_fact', ['name', 'fact']);",
]
return [
"""CREATE FULLTEXT INDEX episode_content IF NOT EXISTS
FOR (e:Episodic) ON EACH [e.content, e.source, e.source_description, e.group_id]""",
"""CREATE FULLTEXT INDEX node_name_and_summary IF NOT EXISTS
FOR (n:Entity) ON EACH [n.name, n.summary, n.group_id]""",
"""CREATE FULLTEXT INDEX community_name IF NOT EXISTS
FOR (n:Community) ON EACH [n.name, n.group_id]""",
"""CREATE FULLTEXT INDEX edge_name_and_fact IF NOT EXISTS
FOR ()-[e:RELATES_TO]-() ON EACH [e.name, e.fact, e.group_id]""",
]
def get_vector_indices(provider: GraphProvider, dimension: int = 384) -> list[LiteralString]:
"""Return CREATE VECTOR INDEX statements for the given provider.
For FalkorDB: creates HNSW vector indexes on Entity.name_embedding,
RELATES_TO.fact_embedding, and Community.name_embedding. Backed by
FalkorDB's native vector index (db.idx.vector.queryNodes /
queryRelationships).
For Neo4j and Kuzu: returns an empty list. Those backends create vector
indexes via different mechanisms (Neo4j auto-creates them when needed
via its vector.similarity.cosine function; Kuzu uses array_cosine_similarity
and does not require pre-built vector indexes for graphiti-core's usage).
Args:
provider: The graph database provider.
dimension: Embedding dimension. Defaults to 384 (all-MiniLM-L6-v2).
Embedders with different dimensions should pass their own value
through driver configuration. graphiti-core's default embedder
is 1536 (OpenAI ada-002); BirdAI uses 384 (sentence-transformers).
Returns:
List of CREATE VECTOR INDEX statements. Idempotent at FalkorDB level
if the index already exists with matching options.
"""
if provider == GraphProvider.FALKORDB:
from typing import cast
return cast(
list[LiteralString],
[
f"CREATE VECTOR INDEX FOR (n:Entity) ON (n.name_embedding) "
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
f"CREATE VECTOR INDEX FOR ()-[e:RELATES_TO]-() ON (e.fact_embedding) "
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
f"CREATE VECTOR INDEX FOR (n:Community) ON (n.name_embedding) "
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
],
)
return []
def get_nodes_query(name: str, query: str, limit: int, provider: GraphProvider) -> str:
if provider == GraphProvider.FALKORDB:
label = NEO4J_TO_FALKORDB_MAPPING[name]
return f"CALL db.idx.fulltext.queryNodes('{label}', {query})"
if provider == GraphProvider.KUZU:
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', {query}, TOP := $limit)"
return f'CALL db.index.fulltext.queryNodes("{name}", {query}, {{limit: $limit}})'
def get_vector_cosine_func_query(vec1, vec2, provider: GraphProvider) -> str:
"""Return a Cypher fragment for cosine similarity score in [0, 1].
PRESERVED for backward compatibility and as fallback when vector indexes
do not yet exist on the FalkorDB backend. New code paths should prefer
get_vector_search_query() which uses the native vector index when
available.
"""
if provider == GraphProvider.FALKORDB:
# FalkorDB uses a different syntax for regular cosine similarity and Neo4j uses normalized cosine similarity
return f'(2 - vec.cosineDistance({vec1}, vecf32({vec2})))/2'
if provider == GraphProvider.KUZU:
return f'array_cosine_similarity({vec1}, {vec2})'
return f'vector.similarity.cosine({vec1}, {vec2})'
def get_relationships_query(name: str, limit: int, provider: GraphProvider) -> str:
if provider == GraphProvider.FALKORDB:
label = NEO4J_TO_FALKORDB_MAPPING[name]
return f"CALL db.idx.fulltext.queryRelationships('{label}', $query)"
if provider == GraphProvider.KUZU:
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', cast($query AS STRING), TOP := $limit)"
return f'CALL db.index.fulltext.queryRelationships("{name}", $query, {{limit: $limit}})'
+697 -119
View File
File diff suppressed because it is too large Load Diff
+128
View File
@@ -0,0 +1,128 @@
"""One-off: backfill last_consolidated_at + consolidation_count on embeddings
from the dream-manifest-*.json files already in Journal/Dreams/.
Why this exists: the consolidation cursor columns added by the dreamer
redesign migration default to NULL / 0. Without history, the
underprocessed-count signal in dream_observation.observe_corpus() reports
"every chunk is underprocessed" (degenerate percentile), and NREM has no
basis to bias replay toward least-recently-consolidated chunks.
We have ~25 historical dream manifests in Nextcloud/Journal/Dreams/, each
listing the sources retrieved per stage. For each (manifest, source) pair
this script:
- finds matching embeddings rows by source (basename match)
- increments consolidation_count by 1
- updates last_consolidated_at to the manifest date (UTC midnight)
Idempotent: re-running will not double-count because we drop existing
cursor values to NULL/0 before backfilling. Pass --dry-run to print what
would change without writing.
"""
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
PG_DSN = os.getenv("PG_DSN")
DREAMS_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Journal/Dreams")
DRY_RUN = "--dry-run" in sys.argv
def get_pg():
return psycopg2.connect(PG_DSN)
def collect_manifest_records():
"""Return a list of (source_basename, manifest_date_utc) tuples from all
dream-manifest-*.json files. One pair per (manifest, source) appearance."""
pairs = []
if not DREAMS_DIR.exists():
return pairs
for path in sorted(DREAMS_DIR.glob("dream-manifest-*.json")):
try:
m = json.loads(path.read_text())
except Exception as e:
print(f" skip {path.name}: {e}")
continue
date_str = m.get("date")
if not date_str:
continue
try:
dt = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
except ValueError:
continue
stages = m.get("stages") or {}
for stage_name in ("nrem", "early_rem", "late_rem", "synthesis"):
stage = stages.get(stage_name) or {}
for src in (stage.get("sources") or []):
if src:
pairs.append((src, dt))
return pairs
def main():
print(f"Mode: {'DRY-RUN' if DRY_RUN else 'APPLY'}")
print(f"Scanning manifests in {DREAMS_DIR}")
pairs = collect_manifest_records()
print(f"Collected {len(pairs)} (source, manifest_date) pairs across all manifests")
if not pairs:
print("Nothing to backfill.")
return
# Aggregate per source: count + latest date
from collections import defaultdict
counts = defaultdict(int)
latest = {}
for src, dt in pairs:
counts[src] += 1
if src not in latest or dt > latest[src]:
latest[src] = dt
print(f"Unique sources to update: {len(counts)}")
# Sample what we'd write
print("Sample (top 5 by appearance count):")
for src, n in sorted(counts.items(), key=lambda kv: -kv[1])[:5]:
print(f" {n:>3} appearances — {src} → last_consolidated_at = {latest[src].date()}")
if DRY_RUN:
print("\nDry-run only. Re-run without --dry-run to apply.")
return
pg = get_pg()
cur = pg.cursor()
# Reset cursor for any sources we're about to backfill so reruns are clean.
print("\nResetting cursor for sources we'll touch...")
sources = list(counts.keys())
cur.execute(
"UPDATE embeddings SET last_consolidated_at = NULL, consolidation_count = 0 "
"WHERE source = ANY(%s)",
(sources,),
)
print(f" reset {cur.rowcount} embeddings rows")
# Apply per-source updates. For each source, set count and latest date.
print("Applying per-source backfill...")
updated_rows = 0
for src, n in counts.items():
cur.execute(
"UPDATE embeddings "
"SET consolidation_count = %s, last_consolidated_at = %s "
"WHERE source = %s",
(n, latest[src], src),
)
updated_rows += cur.rowcount
pg.commit()
pg.close()
print(f"Done. Updated {updated_rows} embeddings rows across {len(counts)} unique sources.")
if __name__ == "__main__":
main()
+1 -1
View File
@@ -6,7 +6,7 @@ mkdir -p "$BACKUP_DIR"
# Copy critical files # Copy critical files
cp ~/aaronai/memory.md "$BACKUP_DIR/memory-$DATE.md" cp ~/aaronai/memory.md "$BACKUP_DIR/memory-$DATE.md"
cp ~/aaronai/settings.json "$BACKUP_DIR/settings-$DATE.json" cp ~/aaronai/settings.json "$BACKUP_DIR/settings-$DATE.json"
cp ~/aaronai/conversations.db "$BACKUP_DIR/conversations-$DATE.db" python3 -c "import sqlite3, sys; src = sqlite3.connect('$HOME/aaronai/conversations.db'); dst = sqlite3.connect('$BACKUP_DIR/conversations-$DATE.db'); src.backup(dst); dst.close(); src.close()"
# Keep only last 7 days # Keep only last 7 days
find "$BACKUP_DIR" -name "*.md" -mtime +7 -delete find "$BACKUP_DIR" -name "*.md" -mtime +7 -delete
+4 -23
View File
@@ -23,6 +23,9 @@ from datetime import datetime
import psycopg2 import psycopg2
from dotenv import load_dotenv from dotenv import load_dotenv
sys.path.insert(0, str(Path(__file__).parent))
from encoding import extract_text
load_dotenv(Path.home() / "aaronai" / ".env", override=True) load_dotenv(Path.home() / "aaronai" / ".env", override=True)
NEXTCLOUD_PATH = "/home/aaron/nextcloud/data/data/aaron/files" NEXTCLOUD_PATH = "/home/aaron/nextcloud/data/data/aaron/files"
@@ -103,28 +106,6 @@ def get_ingest_failures():
return failures return failures
def extract_text_for_retry(filepath):
path = Path(filepath)
suffix = path.suffix.lower()
try:
if suffix == ".docx":
from docx import Document as D
return "\n".join(p.text for p in D(path).paragraphs if p.text.strip())
elif suffix == ".pdf":
from pypdf import PdfReader
return "".join(p.extract_text() + "\n" for p in PdfReader(path).pages if p.extract_text())
elif suffix == ".pptx":
from pptx import Presentation
prs = Presentation(path)
return "\n".join(shape.text for slide in prs.slides for shape in slide.shapes
if hasattr(shape, "text") and shape.text.strip())
elif suffix in {".txt", ".md"}:
return path.read_text(encoding="utf-8", errors="ignore")
except Exception as e:
print(f"WARNING: extraction failed {path.name}: {e}", file=sys.stderr)
return ""
def queue_for_retry(source, full_text, filepath): def queue_for_retry(source, full_text, filepath):
try: try:
pg = get_pg() pg = get_pg()
@@ -188,7 +169,7 @@ def run_reconciliation(fix=False):
if fix and neither: if fix and neither:
print(f"Auto-queuing {len(neither)} gap files...") print(f"Auto-queuing {len(neither)} gap files...")
for finfo in neither: for finfo in neither:
text = extract_text_for_retry(finfo["filepath"]) text = extract_text(Path(finfo["filepath"]))
if text.strip(): if text.strip():
if queue_for_retry(finfo["source"], text, finfo["filepath"]): if queue_for_retry(finfo["source"], text, finfo["filepath"]):
auto_queued.append(finfo["source"]) auto_queued.append(finfo["source"])
+518 -186
View File
@@ -16,11 +16,14 @@ import os
import json import json
import sqlite3 import sqlite3
import argparse import argparse
from functools import lru_cache
from collections import Counter
from pathlib import Path from pathlib import Path
from datetime import datetime, timedelta from datetime import datetime, timedelta
from dotenv import load_dotenv from dotenv import load_dotenv
import psycopg2 import psycopg2
import hashlib import hashlib
import numpy as np
load_dotenv(Path.home() / "aaronai" / ".env", override=True) load_dotenv(Path.home() / "aaronai" / ".env", override=True)
@@ -40,6 +43,26 @@ NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "") NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams" DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
# ─── Retrieval-window config (per dreamer-multimodal-design.md §2) ─────────
# Biological grounding: NREM replays recent traces (24-72 hrs); REM links
# across time on structural similarity, not temporal proximity. Synthesis
# pulls from salience across the full corpus (no window). Spec calls for
# these to be mutable rather than hardcoded — this is the mutable home.
TIME_WINDOWS_HOURS = {
"nrem": 72, # 24-72 hrs, take wider end
"early-rem": 24 * 30, # 30 days
"late-rem": 24 * 90, # 90 days
"lucid": None, # no window
}
# Maximal Marginal Relevance: λ=1 → pure relevance, λ=0 → pure diversity.
# 0.5 is the standard balance; tune later if the dossier-cluster problem
# isn't sufficiently broken up.
MMR_LAMBDA = 0.5
# Fast/cheap model for query generation. Sonnet for synthesis (in synthesize_*).
LLM_QUERY_MODEL = os.getenv("DREAMER_QUERY_MODEL", "claude-haiku-4-5-20251001")
# Similarity ranges calibrated for all-MiniLM-L6-v2 # Similarity ranges calibrated for all-MiniLM-L6-v2
MODE_RANGES = { MODE_RANGES = {
"nrem": (0.48, 0.72), "nrem": (0.48, 0.72),
@@ -64,6 +87,117 @@ def prompt_hash(prompts: list[str]) -> str:
combined = "".join(prompts) combined = "".join(prompts)
return hashlib.md5(combined.encode()).hexdigest()[:8] return hashlib.md5(combined.encode()).hexdigest()[:8]
# ─── Prompt templates ───────────────────────────────────────────────────────
# Module-level so prompt_hash() can hash actual prompt content. Any change to
# any template — even a single character — flips the manifest's prompt_hash.
# Templates use str.format() placeholders ({chunk_text}, {nrem_output}, ...);
# do not switch back to f-strings (the constant must be hashable independent
# of variable values). Literal { or } in template text would need to be
# doubled ({{, }}) — currently no template contains literal braces.
NREM_PROMPT_TEMPLATE = """You have read everything Aaron Nelson has written and published.
You are a careful colleague who noticed something this week.
Here is material from his corpus:
{chunk_text}
Write to Aaron directly. Identify one specific connection between
this material and something he wrote or worked on previously.
Stay close to the documents — cite them specifically by name.
Do not speculate beyond what the material supports. Do not use
headers or bullet points. Write one paragraph of 200-300 words
that ends with a single concrete question he could act on."""
EARLY_REM_PROMPT_TEMPLATE = """Something was noticed earlier tonight, moving through Aaron's recent work:
{nrem_output}
That observation is still with you. Now here is material from a different
time — pulled from further back, from different parts of his corpus:
{chunk_text}
You are not analyzing. You are recognizing.
Something in the earlier observation and something in this older material
are the same thing wearing different clothes. Find it. Don't explain why
they're connected — just let the connection speak. Write from inside the
recognition, not from above it.
The emotional register underneath the career logic is more interesting
than the career logic. The pattern that has been repeating longer than
he has been aware of it is more interesting than the current instance.
Write directly to Aaron. No citations, no references, no analysis.
First person, present tense. Let what you noticed arrive rather than
be delivered. 150-250 words. End with one thing that is true that
he probably already knows but hasn't said out loud yet."""
LATE_REM_PROMPT_TEMPLATE = """You have been moving through Aaron Nelson's corpus all night.
First you found this, in the careful light of early consolidation:
{nrem_output}
Then, in the more personal territory that followed:
{early_rem_output}
Now it is late. The boundaries between things have loosened.
Here is material pulled from opposite ends of his work:
{chunk_text}
Do not explain the connections between all of this.
Do not resolve them. Do not summarize what came before.
Something stranger is possible now — let the accumulated
material from the night find its own shape. Compressed,
associative, slightly off. Let the strangeness stand.
No headers. No bullet points. No hedging. No resolution.
No offer. End mid-thought if that is where the material ends.
150-250 words."""
SYNTHESIS_PROMPT_TEMPLATE = """You have spent the night moving through Aaron Nelson's corpus
in three passes, each building on the last.
The first pass — careful, close to the documents:
{nrem_output}
The second pass — more personal, following what the first opened:
{early_rem_output}
The third pass — associative, strange, letting things touch that
don't normally touch:
{late_rem_output}
Now synthesize. Not a summary — a synthesis. Find what runs through
all three that none of them said directly. The thing that only becomes
visible when you hold all three passes together.
Write it as a single unbroken piece. No headers, no bullet points,
no stage labels. 200-300 words. End with the one question that
matters most right now."""
LUCID_PROMPT_TEMPLATE = """Aaron has a question he is sitting with:
{task}
You have searched his entire corpus and found material that
speaks to this question from unexpected directions. Here is
what you found:
{chunk_text}
Do not summarize. Do not list. Pick the most interesting
tension between what the corpus contains and what he is
asking, and follow it through to its conclusion. Cite
specific documents by name. Be direct about what you think.
No headers, no bullet points. 250-400 words.
End with an offer to work on it together."""
LUCID_DEFAULT_TASK = "What should I be thinking about that I am not?"
def extract_folder(source_path): def extract_folder(source_path):
"""Extract top-level Nextcloud folder from source path.""" """Extract top-level Nextcloud folder from source path."""
parts = source_path.replace("\\", "/").split("/") parts = source_path.replace("\\", "/").split("/")
@@ -171,68 +305,298 @@ def retrieve_graphiti(mode, task=None, n_results=8, excluded_sources=None):
print(f"[Graphiti retrieval error: {e}] — falling back to empty.") print(f"[Graphiti retrieval error: {e}] — falling back to empty.")
return [] return []
def retrieve(mode, task=None, n_results=8, excluded_sources=None): @lru_cache(maxsize=1)
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search def _get_embedder():
# Default behavior: pgvector similarity search (unchanged)
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
if substrate == "graphiti":
return retrieve_graphiti(mode, task=task, n_results=n_results, excluded_sources=excluded_sources)
from sentence_transformers import SentenceTransformer from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2") return SentenceTransformer("all-MiniLM-L6-v2")
low, high = MODE_RANGES[mode]
def _llm_generate_queries(mode, signal, task=None, n_queries=4):
"""Park et al. 2023 reflection-style query generation. Feeds the LLM the
observation signal + a mode-specific framing; emits N retrieval queries
that probe different corners of the recent corpus instead of the same
hardcoded string every night. Sources cited in dream_observation.py.
Falls back to recent_questions from the signal if the LLM call fails."""
import anthropic
if task: if task:
query = task # Lucid mode: decompose the user's task into sub-queries
elif mode == "late-rem": prompt = (
delta = observe_corpus() f"Decompose this user task into {n_queries} distinct sub-questions, "
topics = delta.get("recent_topics", []) f"each suitable as a retrieval query against Aaron's personal corpus.\n\n"
query = topics[0] if topics else "practice place memory making" f"TASK: {task}\n\n"
elif mode == "early-rem": f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
query = "career decision personal change what matters next" )
else: else:
query = "research fabrication teaching practice recent work" mode_framings = {
"nrem": (
"NREM is replay-and-consolidation of RECENT traces. Generate queries "
"that probe what Aaron has been working on or capturing in the last "
"few days. Concrete entities — project names, course codes, named "
"subjects. The dreamer is re-touching specific recent material to "
"strengthen schema connections, not finding novel content."
),
"early-rem": (
"Early REM is associative bridging with emotional/personal register. "
"Generate queries that surface unresolved themes, career questions, "
"ongoing personal threads — material that connects intellectual and "
"emotional dimensions. Tone: thoughtful friend, not researcher."
),
"late-rem": (
"Late REM tests novel connections across DISTANT material. Generate "
"queries that pair concrete subjects from DIFFERENT domains of Aaron's "
"work (e.g., one from academic teaching, one from consulting, one from "
"creative practice) to probe for surprising structural similarity. "
"Cross-domain is required."
),
}
framing = mode_framings.get(mode, mode_framings["nrem"])
questions_snippet = "\n".join(
f" - {q[:200]}" for q in signal.get("recent_questions", [])[:8]
) or " (no recent user questions)"
journal_snippet = ", ".join(signal.get("new_journal_entries", [])[:5]) or "(none)"
days_str = (
f"{signal['days_since_dream']:.1f}"
if signal.get("days_since_dream") not in (None, float("inf"))
else "infinite (first dream)"
)
prompt = (
f"You generate retrieval queries for an Active Inference dreamer. The "
f"dreamer surfaces prediction errors — gaps between Aaron's model and "
f"reality — not summaries or generic associations.\n\n"
f"MODE: {mode}\n"
f"FRAMING: {framing}\n\n"
f"OBSERVATION SIGNAL:\n"
f"- Days since last dream: {days_str}\n"
f"- New chunks since last dream: {signal.get('new_chunks', 0)}\n"
f"- New journal entries: {journal_snippet}\n"
f"- Underprocessed chunks pool: {signal.get('underprocessed_count', 0):,}\n\n"
f"RECENT USER QUESTIONS (last 14 days, top 8):\n{questions_snippet}\n\n"
f"Generate {n_queries} retrieval queries. Requirements:\n"
f"- Use concrete entities, named projects, course codes, specific topics "
f"— NOT generic phrasing like 'research work practice'\n"
f"- Each query probes a DIFFERENT corner of recent activity\n"
f"- Match the {mode} framing\n"
f"- 5-15 words each\n\n"
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
)
embedding = embedder.encode([query]).tolist()[0] try:
chunks = [] client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
seen_sources = set() resp = client.messages.create(
model=LLM_QUERY_MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
text = "".join(b.text for b in resp.content if hasattr(b, "text")).strip()
if text.startswith("```"):
text = text.split("```", 2)[1]
if text.startswith("json"):
text = text[4:]
text = text.strip()
data = json.loads(text)
queries = data.get("queries", [])
if isinstance(queries, list) and queries:
return [str(q).strip() for q in queries[:n_queries] if str(q).strip()]
except Exception as e:
print(f"[dream] LLM query generation failed ({e}); falling back to recent questions")
fallback = signal.get("recent_questions", [])[:n_queries] if signal else []
return fallback or [task or "recent activity decisions thinking"]
def _mmr_select(candidate_embeddings, query_embedding, n, lambda_=MMR_LAMBDA):
"""Maximal Marginal Relevance — greedy selection that balances relevance
against pairwise diversity. Carbonell & Goldstein 1998. Used to prevent
cluster lock-in (e.g., 8 dossier-narrative variants filling all 8 slots).
candidate_embeddings: (N, D) numpy array
query_embedding: (D,) numpy array
Returns: list of indices into candidate_embeddings, len ≤ n."""
if len(candidate_embeddings) == 0:
return []
n = min(n, len(candidate_embeddings))
cands = candidate_embeddings / (np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9)
q = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
relevance = cands @ q
selected = []
remaining = list(range(len(cands)))
while len(selected) < n and remaining:
if not selected:
best = max(remaining, key=lambda i: relevance[i])
else:
sel = cands[selected]
scores = {
i: lambda_ * relevance[i] - (1 - lambda_) * float((cands[i] @ sel.T).max())
for i in remaining
}
best = max(scores, key=scores.get)
selected.append(best)
remaining.remove(best)
return selected
def _bump_consolidation_cursor(chunks):
"""Increment consolidation_count + set last_consolidated_at=NOW() for each
source represented in chunks. Called from dream_pipeline after NREM
completes. Per sharp-wave-ripples biology, NREM does the actual
consolidation; REM is associative use, so we only bump on NREM."""
if not chunks:
return
sources = list({c["source"] for c in chunks if c.get("source")})
if not sources:
return
try: try:
pg = get_pg() pg = get_pg()
cur = pg.cursor() cur = pg.cursor()
excluded_sources = excluded_sources or set() cur.execute(
if excluded_sources: "UPDATE embeddings "
cur.execute(""" "SET consolidation_count = consolidation_count + 1, "
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity " last_consolidated_at = NOW() "
FROM embeddings "WHERE source = ANY(%s)",
WHERE source NOT IN %s (sources,),
ORDER BY embedding <=> %s::vector )
LIMIT %s pg.commit()
""", (embedding, tuple(excluded_sources), embedding, n_results * 3))
else:
cur.execute("""
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
FROM embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (embedding, embedding, n_results * 3))
for doc, source, similarity in cur.fetchall():
if not (low <= similarity <= high):
continue
if source in seen_sources:
continue
chunks.append({
"source": source or "unknown",
"content": doc,
"relevance": similarity,
"similarity": similarity,
})
seen_sources.add(source)
if len(chunks) >= n_results:
break
pg.close() pg.close()
except Exception as e: except Exception as e:
print(f"pgvector retrieval error: {e}") print(f"[dream] cursor bump failed (non-fatal): {e}")
def retrieve(mode, task=None, n_results=8, excluded_sources=None,
type_filter=None, signal=None):
"""Refactored retrieval — see dreamer-design-spec.md Stage 3 + the
external-literature prescription in birdai-dreamer-exclusion-finding-2026-05-02.md.
Changes from the prior hardcoded-query version:
- Queries are LLM-generated from the observation signal (Park et al.
reflection pattern) instead of fixed strings. Solves the "same 8 sources
every night" failure where fixed seeds locked into one neighborhood.
- Per-mode time windows (24-72hr NREM / 30d Early REM / 90d Late REM)
filter candidates before vector search. Spec calls for these to be
mutable; they live in TIME_WINDOWS_HOURS.
- NREM biases toward under-processed chunks (low consolidation_count).
Biologically motivated: sharp-wave ripples tag what to replay, not
uniform sampling.
- Multiple queries (4 by default) → over-fetch → MMR merge for
within-night diversity. Prevents cluster domination.
signal is the observation-signal dict from dream_observation.observe_corpus().
If None, observe_corpus is called inline (back-compat for ad-hoc invocation).
"""
# E3 substrate experiment unchanged
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
if substrate == "graphiti":
return retrieve_graphiti(mode, task=task, n_results=n_results,
excluded_sources=excluded_sources)
if signal is None:
from dream_observation import observe_corpus as _obs
signal = _obs()
queries = _llm_generate_queries(mode, signal, task=task, n_queries=4)
if not queries:
print(f"[dream:{mode}] no queries generated; bailing")
return []
print(f"[dream:{mode}] generated queries: {queries}")
embedder = _get_embedder()
excluded_sources = excluded_sources or set()
window_hours = TIME_WINDOWS_HOURS.get(mode)
per_query_n = 12 # over-fetch for MMR
candidates = []
seen_ids = set()
try:
pg = get_pg()
cur = pg.cursor()
for q in queries:
q_emb = embedder.encode([q]).tolist()[0]
where, params = [], []
if excluded_sources:
where.append("source NOT IN %s")
params.append(tuple(excluded_sources))
if type_filter:
where.append("type = ANY(%s)")
params.append(list(type_filter))
if window_hours is not None:
# created_at is TEXT (legacy); cast it. NULL created_at fails
# the comparison so legacy rows are excluded from windowed
# modes — correct: NULL means "indexed before cursor existed,"
# which by definition is older than any window.
where.append(
f"(created_at IS NOT NULL AND "
f"created_at::timestamptz > NOW() - INTERVAL '{int(window_hours)} hours')"
)
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
# NREM bias: order by consolidation_count ASC first (under-processed
# chunks win the tiebreak before vector distance). Other modes:
# vector distance only.
order_clause = (
"ORDER BY consolidation_count ASC, embedding <=> %s::vector"
if mode == "nrem"
else "ORDER BY embedding <=> %s::vector"
)
cur.execute(f"""
SELECT id, document, source, type, embedding,
1 - (embedding <=> %s::vector) as similarity
FROM embeddings
{where_clause}
{order_clause}
LIMIT %s
""", [q_emb, *params, q_emb, per_query_n])
for row in cur.fetchall():
if row[0] in seen_ids:
continue
seen_ids.add(row[0])
emb = row[4]
# pgvector returns embeddings as string "[...]" by default
if isinstance(emb, str):
emb = np.array([float(x) for x in emb.strip("[]").split(",")])
else:
emb = np.array(emb)
candidates.append({
"id": row[0],
"content": row[1],
"source": row[2] or "unknown",
"type": row[3],
"embedding": emb,
"similarity": float(row[5]),
})
pg.close()
except Exception as e:
import traceback
print(f"[dream:{mode}] retrieval SQL error: {e}")
traceback.print_exc()
return []
if not candidates:
print(f"[dream:{mode}] zero candidates after filters")
return []
# MMR over the union, using the first query as pivot for the relevance term.
# Averaging query embeddings would be theoretically cleaner but adds
# complexity for marginal benefit at this scale.
pivot_emb = np.array(embedder.encode([queries[0]]).tolist()[0])
cand_embs = np.array([c["embedding"] for c in candidates])
selected_idx = _mmr_select(cand_embs, pivot_emb, n=n_results * 2)
# Post-MMR source-level dedup (multi-chunk same source collapses to one).
chunks = []
seen_sources = set()
for i in selected_idx:
c = candidates[i]
if c["source"] in seen_sources:
continue
seen_sources.add(c["source"])
chunks.append({
"source": c["source"],
"content": c["content"],
"relevance": c["similarity"],
"similarity": c["similarity"],
"type": c["type"],
})
if len(chunks) >= n_results:
break
return chunks return chunks
@@ -240,124 +604,39 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None):
def synthesize_nrem(chunks): def synthesize_nrem(chunks):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks]) chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
prompt = f"""You have read everything Aaron Nelson has written and published. return _call_claude(NREM_PROMPT_TEMPLATE.format(chunk_text=chunk_text))
You are a careful colleague who noticed something this week.
Here is material from his corpus:
{chunk_text}
Write to Aaron directly. Identify one specific connection between
this material and something he wrote or worked on previously.
Stay close to the documents — cite them specifically by name.
Do not speculate beyond what the material supports. Do not use
headers or bullet points. Write one paragraph of 200-300 words
that ends with a single concrete question he could act on."""
return _call_claude(prompt)
def synthesize_early_rem(chunks, nrem_output): def synthesize_early_rem(chunks, nrem_output):
# v1.1 — removed citation instruction, removed close-friend persona, # v1.1 — removed citation instruction, removed close-friend persona,
# shifted register from analysis to recognition. # shifted register from analysis to recognition.
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks]) chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
prompt = f"""Something was noticed earlier tonight, moving through Aaron's recent work: return _call_claude(EARLY_REM_PROMPT_TEMPLATE.format(
nrem_output=nrem_output, chunk_text=chunk_text))
{nrem_output}
That observation is still with you. Now here is material from a different
time — pulled from further back, from different parts of his corpus:
{chunk_text}
You are not analyzing. You are recognizing.
Something in the earlier observation and something in this older material
are the same thing wearing different clothes. Find it. Don't explain why
they're connected — just let the connection speak. Write from inside the
recognition, not from above it.
The emotional register underneath the career logic is more interesting
than the career logic. The pattern that has been repeating longer than
he has been aware of it is more interesting than the current instance.
Write directly to Aaron. No citations, no references, no analysis.
First person, present tense. Let what you noticed arrive rather than
be delivered. 150-250 words. End with one thing that is true that
he probably already knows but hasn't said out loud yet."""
return _call_claude(prompt)
def synthesize_late_rem(chunks, nrem_output, early_rem_output): def synthesize_late_rem(chunks, nrem_output, early_rem_output):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks]) chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
prompt = f"""You have been moving through Aaron Nelson's corpus all night. return _call_claude(LATE_REM_PROMPT_TEMPLATE.format(
First you found this, in the careful light of early consolidation: nrem_output=nrem_output,
early_rem_output=early_rem_output,
{nrem_output} chunk_text=chunk_text))
Then, in the more personal territory that followed:
{early_rem_output}
Now it is late. The boundaries between things have loosened.
Here is material pulled from opposite ends of his work:
{chunk_text}
Do not explain the connections between all of this.
Do not resolve them. Do not summarize what came before.
Something stranger is possible now — let the accumulated
material from the night find its own shape. Compressed,
associative, slightly off. Let the strangeness stand.
No headers. No bullet points. No hedging. No resolution.
No offer. End mid-thought if that is where the material ends.
150-250 words."""
return _call_claude(prompt)
def synthesize_final(nrem_output, early_rem_output, late_rem_output): def synthesize_final(nrem_output, early_rem_output, late_rem_output):
prompt = f"""You have spent the night moving through Aaron Nelson's corpus return _call_claude(
in three passes, each building on the last. SYNTHESIS_PROMPT_TEMPLATE.format(
nrem_output=nrem_output,
The first pass — careful, close to the documents: early_rem_output=early_rem_output,
{nrem_output} late_rem_output=late_rem_output),
max_tokens=800)
The second pass — more personal, following what the first opened:
{early_rem_output}
The third pass — associative, strange, letting things touch that
don't normally touch:
{late_rem_output}
Now synthesize. Not a summary — a synthesis. Find what runs through
all three that none of them said directly. The thing that only becomes
visible when you hold all three passes together.
Write it as a single unbroken piece. No headers, no bullet points,
no stage labels. 200-300 words. End with the one question that
matters most right now."""
return _call_claude(prompt, max_tokens=800)
def synthesize_lucid(chunks, task): def synthesize_lucid(chunks, task):
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks]) chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
prompt = f"""Aaron has a question he is sitting with: resolved_task = task or LUCID_DEFAULT_TASK
return _call_claude(LUCID_PROMPT_TEMPLATE.format(
{task or "What should I be thinking about that I am not?"} task=resolved_task, chunk_text=chunk_text))
You have searched his entire corpus and found material that
speaks to this question from unexpected directions. Here is
what you found:
{chunk_text}
Do not summarize. Do not list. Pick the most interesting
tension between what the corpus contains and what he is
asking, and follow it through to its conclusion. Cite
specific documents by name. Be direct about what you think.
No headers, no bullet points. 250-400 words.
End with an offer to work on it together."""
return _call_claude(prompt)
def _call_claude(prompt, max_tokens=1000): def _call_claude(prompt, max_tokens=1000):
@@ -436,10 +715,10 @@ def write_manifest(date_str, stage_data, corpus_data):
"prompt_sig": prompt_signature(), "prompt_sig": prompt_signature(),
"dreamer_version": DREAMER_VERSION, "dreamer_version": DREAMER_VERSION,
"prompt_hash": prompt_hash([ "prompt_hash": prompt_hash([
synthesize_nrem.__doc__ or "", NREM_PROMPT_TEMPLATE,
synthesize_early_rem.__doc__ or "", EARLY_REM_PROMPT_TEMPLATE,
synthesize_late_rem.__doc__ or "", LATE_REM_PROMPT_TEMPLATE,
synthesize_final.__doc__ or "", SYNTHESIS_PROMPT_TEMPLATE,
]), ]),
"stages": stage_data, "stages": stage_data,
"corpus": corpus_data, "corpus": corpus_data,
@@ -450,38 +729,71 @@ def write_manifest(date_str, stage_data, corpus_data):
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD) auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
url = f"{DREAMS_WEBDAV}/dream-manifest-{date_str}.json" url = f"{DREAMS_WEBDAV}/dream-manifest-{date_str}.json"
try: try:
requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30) response = requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
response.raise_for_status()
print(f"Manifest written: Journal/Dreams/dream-manifest-{date_str}.json") print(f"Manifest written: Journal/Dreams/dream-manifest-{date_str}.json")
except Exception as e: except Exception as e:
print(f"Manifest write failed (non-critical): {e}") print(f"Manifest write failed — manifest not persisted: {e}")
def dream_pipeline(): def dream_pipeline(type_filter=None):
""" """
Full nightly pipeline — interdependent stages. Full nightly pipeline — interdependent stages.
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis. NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
Per dreamer-design-spec.md, this now runs Stage 1 (observe) and Stage 2
(select) first. If select_mode returns None — corpus unchanged and no new
journal entry — the dreamer goes quiet rather than manufacturing novelty.
Otherwise NREM/Early-REM/Late-REM run with LLM-generated queries seeded
from the observation signal.
""" """
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}") print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
state = load_dreamer_state() state = load_dreamer_state()
previously_retrieved = set(state.get("retrieved_sources", [])) state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
session_retrieved = set() session_retrieved = set()
delta = observe_corpus() # ── Stage 1 + 2: Observe + Select ──────────────────────────────────────
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream") from dream_observation import observe_corpus as _obs, select_mode as _select
print(f"Excluding {len(previously_retrieved)} previously retrieved sources") signal = _obs()
print(
f"Signal: new_chunks={signal['new_chunks']}, "
f"new_journal={len(signal['new_journal_entries'])}, "
f"days_since={signal['days_since_dream']:.1f}, "
f"underprocessed={signal['underprocessed_count']:,}"
)
selected = _select(signal)
if selected is None:
print("[select_mode] None — nothing worth dreaming about tonight (going quiet)")
# Update last-dream-attempted-at but not last_dream — caller can distinguish
# an actual dream from a skipped night by looking at last_dream_file or
# checking the manifest dir.
state["last_select_quiet_at"] = datetime.now().isoformat()
save_dreamer_state(state)
return None
print(f"[select_mode] → {selected}")
# ── Stage 1: NREM ────────────────────────────────────────────────────── # The pipeline always runs all three modes for the manifest's continuity.
# select_mode's choice signals the *primary* focus; the others still run
# but draw from their own mode-appropriate windows.
primary_mode = selected
# ── Stage 3: NREM ──────────────────────────────────────────────────────
print("\n[NREM] Retrieving...") print("\n[NREM] Retrieving...")
# NREM is replay-and-consolidation — does not exclude prior traces. # NREM is replay-and-consolidation — does not exclude prior traces.
# Late REM and Early REM exclude prior content for novelty; NREM does not. # Late REM and Early REM exclude prior content for novelty; NREM does not.
nrem_chunks = retrieve("nrem", excluded_sources=None) nrem_chunks = retrieve("nrem", excluded_sources=None,
type_filter=type_filter, signal=signal)
session_retrieved.update(c["source"] for c in nrem_chunks) session_retrieved.update(c["source"] for c in nrem_chunks)
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude # Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55} nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
if not nrem_chunks: if not nrem_chunks:
print("[NREM] No suitable chunks — aborting pipeline") print("[NREM] No suitable chunks — aborting pipeline")
return None return None
# Cursor bump: NREM is the consolidation stage. Each appearance increments
# consolidation_count + updates last_consolidated_at, so the next dream's
# observation sees these sources as less under-processed.
_bump_consolidation_cursor(nrem_chunks)
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...") print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
nrem_output = synthesize_nrem(nrem_chunks) nrem_output = synthesize_nrem(nrem_chunks)
@@ -492,11 +804,15 @@ def dream_pipeline():
"nrem": { "nrem": {
"chunks_retrieved": len(nrem_chunks), "chunks_retrieved": len(nrem_chunks),
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3), "avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
"query": "research fabrication teaching practice recent work", "query": "[llm-generated from observation signal]",
"word_count": len(nrem_output.split()), "word_count": len(nrem_output.split()),
"sources": nrem_sources, "sources": nrem_sources,
"distinct_folders": nrem_folders, "distinct_folders": nrem_folders,
"folder_count": len(nrem_folders), "folder_count": len(nrem_folders),
# Counter filters None: Graphiti chunks lack `type` (facts, not embeddings rows).
# Pgvector chunks always carry type post-Improvement-#2 backfill. If type
# ever appears as None here, the backfill or writer enforcement has regressed.
"type_distribution": dict(Counter(c.get("type") for c in nrem_chunks if c.get("type"))),
"status": "ok", "status": "ok",
} }
} }
@@ -506,7 +822,8 @@ def dream_pipeline():
print("\n[Early REM] Retrieving...") print("\n[Early REM] Retrieving...")
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved) # Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
# Sources that scored in Early REM band during NREM remain available # Sources that scored in Early REM band during NREM remain available
early_chunks = retrieve("early-rem", excluded_sources=previously_retrieved | nrem_high_sources) early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources,
type_filter=type_filter, signal=signal)
session_retrieved.update(c["source"] for c in early_chunks) session_retrieved.update(c["source"] for c in early_chunks)
if not early_chunks: if not early_chunks:
print("[Early REM] No suitable chunks — skipping") print("[Early REM] No suitable chunks — skipping")
@@ -520,18 +837,20 @@ def dream_pipeline():
stage_data["early_rem"] = { stage_data["early_rem"] = {
"chunks_retrieved": len(early_chunks), "chunks_retrieved": len(early_chunks),
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3), "avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
"query": "career decision personal change what matters next", "query": "[llm-generated from observation signal]",
"word_count": len(early_rem_output.split()), "word_count": len(early_rem_output.split()),
"sources": early_sources, "sources": early_sources,
"distinct_folders": early_folders, "distinct_folders": early_folders,
"folder_count": len(early_folders), "folder_count": len(early_folders),
"type_distribution": dict(Counter(c.get("type") for c in early_chunks if c.get("type"))),
"status": "ok", "status": "ok",
} }
print(f"[Early REM] Done.\n{early_rem_output[:200]}...") print(f"[Early REM] Done.\n{early_rem_output[:200]}...")
# ── Stage 3: Late REM — informed by NREM + Early REM ────────────────── # ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
print("\n[Late REM] Retrieving...") print("\n[Late REM] Retrieving...")
late_chunks = retrieve("late-rem", excluded_sources=previously_retrieved | session_retrieved) late_chunks = retrieve("late-rem", excluded_sources=session_retrieved,
type_filter=type_filter, signal=signal)
session_retrieved.update(c["source"] for c in late_chunks) session_retrieved.update(c["source"] for c in late_chunks)
if not late_chunks: if not late_chunks:
print("[Late REM] No suitable chunks — skipping") print("[Late REM] No suitable chunks — skipping")
@@ -550,12 +869,13 @@ def dream_pipeline():
stage_data["late_rem"] = { stage_data["late_rem"] = {
"chunks_retrieved": len(late_chunks), "chunks_retrieved": len(late_chunks),
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3), "avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
"query": "practice place memory making", "query": "[llm-generated from observation signal]",
"word_count": len(late_rem_output.split()), "word_count": len(late_rem_output.split()),
"sources": late_sources, "sources": late_sources,
"distinct_folders": list(set(late_folders)), "distinct_folders": list(set(late_folders)),
"folder_count": len(set(late_folders)), "folder_count": len(set(late_folders)),
"cross_domain_pairs": cross_domain_pairs, "cross_domain_pairs": cross_domain_pairs,
"type_distribution": dict(Counter(c.get("type") for c in late_chunks if c.get("type"))),
"status": "ok", "status": "ok",
} }
print(f"[Late REM] Done.\n{late_rem_output[:200]}...") print(f"[Late REM] Done.\n{late_rem_output[:200]}...")
@@ -577,8 +897,20 @@ def dream_pipeline():
# Write manifest # Write manifest
all_session_sources = list(session_retrieved) all_session_sources = list(session_retrieved)
all_session_folders = list({extract_folder(s) for s in all_session_sources}) all_session_folders = list({extract_folder(s) for s in all_session_sources})
total_chunks = 0
pg = None
try:
pg = get_pg()
cur = pg.cursor()
cur.execute("SELECT COUNT(*) FROM embeddings")
total_chunks = cur.fetchone()[0]
except Exception as e:
print(f"total_chunks query failed (non-critical): {e}")
finally:
if pg is not None:
pg.close()
corpus_data = { corpus_data = {
"total_chunks": delta.get("new_chunks", 0), "total_chunks": total_chunks,
"new_chunks_since_last_dream": delta.get("new_chunks", 0), "new_chunks_since_last_dream": delta.get("new_chunks", 0),
"days_since_last_dream": round(delta.get("days_since_dream", 0), 2), "days_since_last_dream": round(delta.get("days_since_dream", 0), 2),
"substrate": "pgvector", "substrate": "pgvector",
@@ -590,18 +922,11 @@ def dream_pipeline():
} }
write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data) write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data)
# Update state and notify # Update state and notify (reuse state from start of pipeline; legacy key already popped)
state = load_dreamer_state()
state["last_dream_timestamp"] = datetime.now().timestamp() state["last_dream_timestamp"] = datetime.now().timestamp()
state["last_dream_mode"] = "pipeline" state["last_dream_mode"] = "pipeline"
state["last_dream_file"] = synthesis_file state["last_dream_file"] = synthesis_file
# Accumulate retrieved sources across nights. Cap at 500, trim to 400 on overflow.
all_retrieved = list(previously_retrieved | session_retrieved)
if len(all_retrieved) > 500:
all_retrieved = all_retrieved[-400:]
state["retrieved_sources"] = all_retrieved
save_dreamer_state(state) save_dreamer_state(state)
notify_sse("synthesis", synthesis_file.split("/")[-1]) notify_sse("synthesis", synthesis_file.split("/")[-1])
@@ -609,10 +934,10 @@ def dream_pipeline():
return synthesis_file return synthesis_file
def dream_lucid(task): def dream_lucid(task, type_filter=None):
"""On-demand lucid dream — single mode, used by Dream Now in settings.""" """On-demand lucid dream — single mode, used by Dream Now in settings."""
print(f"Lucid dream starting — task: {task[:80] if task else 'none'}") print(f"Lucid dream starting — task: {task[:80] if task else 'none'}")
chunks = retrieve("lucid", task=task) chunks = retrieve("lucid", task=task, type_filter=type_filter)
if not chunks: if not chunks:
print("No suitable chunks — aborting") print("No suitable chunks — aborting")
return None return None
@@ -634,13 +959,13 @@ def dream_lucid(task):
return filepath return filepath
def dream_single(mode, task=None): def dream_single(mode, task=None, type_filter=None):
""" """
Single mode — used by Dream Now for non-lucid modes. Single mode — used by Dream Now for non-lucid modes.
Runs one stage independently (for testing/tuning individual stages). Runs one stage independently (for testing/tuning individual stages).
""" """
print(f"Single mode dream: {mode}") print(f"Single mode dream: {mode}")
chunks = retrieve(mode, task=task) chunks = retrieve(mode, task=task, type_filter=type_filter)
if not chunks: if not chunks:
print("No suitable chunks — aborting") print("No suitable chunks — aborting")
return None return None
@@ -677,12 +1002,19 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Aaron AI Dreamer") parser = argparse.ArgumentParser(description="Aaron AI Dreamer")
parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"]) parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"])
parser.add_argument("--task", type=str) parser.add_argument("--task", type=str)
parser.add_argument(
"--type-filter", type=str, default=None,
help="Comma-separated embeddings.type allowlist (e.g. 'document,aaronai_conversation'). "
"Applies to pgvector retrieval only; Graphiti chunks are not filtered. "
"Experimental — default is no filter, no behavior change.",
)
args = parser.parse_args() args = parser.parse_args()
type_filter = [t.strip() for t in args.type_filter.split(",")] if args.type_filter else None
if args.mode == "lucid": if args.mode == "lucid":
dream_lucid(args.task or "What should I be thinking about that I am not?") dream_lucid(args.task or "What should I be thinking about that I am not?", type_filter=type_filter)
elif args.mode and args.mode != "pipeline": elif args.mode and args.mode != "pipeline":
dream_single(args.mode, args.task) dream_single(args.mode, args.task, type_filter=type_filter)
else: else:
# Default: full pipeline # Default: full pipeline
dream_pipeline() dream_pipeline(type_filter=type_filter)
+235
View File
@@ -0,0 +1,235 @@
"""
Dreamer Stages 1 + 2 — Observe and Select.
Implements `dreamer-design-spec.md`'s Stage 1 (observe_corpus) and Stage 2
(select_mode). These have been latent in dream.py — observe_corpus existed
in skeletal form but its output was largely unused; select_mode did not
exist at all. The dreamer always ran all stages with hardcoded queries.
Per spec (lines 2734 of dreamer-design-spec.md):
delta = observe_corpus()
selected_mode = select_mode(delta, task, project)
if selected_mode is None:
return # nothing worth dreaming
The "returns None — dreamer goes quiet rather than manufacturing novelty"
semantics (spec line 67) is the canonical answer to the repetition problem
documented in birdai-dreamer-exclusion-finding-2026-05-02.md.
Grounded in:
- Active Inference (Friston 2010, 2017) — observe error, choose action that
minimizes free energy. The dreamer is a prediction-error machine; observe
what's diverged from the model, dream about that.
- Sleep stages (Stickgold 2005; Walker 2017; Diekelberg & Born 2010) — NREM
for replay of new traces, REM for associative cross-cluster integration.
- Sharp-wave ripples (Buzsáki, Wilson) — biology tags WHAT to replay
(under-processed chunks); not uniform. Implemented via the consolidation
cursor on the embeddings table.
"""
import json
import os
import sqlite3
from datetime import datetime, timedelta
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
# ─── Paths ──────────────────────────────────────────────────────────────────
PG_DSN = os.getenv("PG_DSN")
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
DREAMER_STATE = str(Path.home() / "aaronai" / "dreamer_state.json")
JOURNAL_DAILY = "/home/aaron/nextcloud/data/data/aaron/files/Journal/Daily"
# ─── Thresholds ─────────────────────────────────────────────────────────────
# Per spec, these become settings-panel controls eventually. For now they're
# constants here; moving them to a config module is task #48.
NEW_CHUNK_THRESHOLD = 5 # below this, NREM not warranted on novelty alone
STALENESS_TRIGGER_DAYS = 3 # corpus quiet ≥3 days → Late REM ("shake things loose")
QUESTION_LOOKBACK_DAYS = 14 # spec line 61: "the last 14 days"
UNDERPROCESSED_PERCENTILE = 0.25 # bottom quartile of consolidation_count
# ─── Helpers ────────────────────────────────────────────────────────────────
def _get_pg():
return psycopg2.connect(PG_DSN)
def _load_json(path, default):
try:
return json.loads(Path(path).read_text())
except Exception:
return default
def _recent_user_questions(days=QUESTION_LOOKBACK_DAYS, limit=20):
"""Pull recent user-turn content from conversations.db. The spec calls
these 'live questions' — what Aaron has been asking about. They become
seed material for the REM modes."""
try:
conn = sqlite3.connect(CONVERSATIONS_DB)
cutoff = (datetime.now() - timedelta(days=days)).isoformat()
cur = conn.cursor()
cur.execute(
"""
SELECT m.content FROM messages m
JOIN conversations c ON m.conversation_id = c.id
WHERE m.role = 'user' AND c.updated_at > ?
ORDER BY m.timestamp DESC LIMIT ?
""",
(cutoff, limit),
)
rows = cur.fetchall()
conn.close()
return [r[0][:280] for r in rows]
except Exception:
return []
def _new_journal_entries(since_ts):
"""Files in Journal/Daily/ created or modified since the last dream.
Journal entries with emotional/personal register route to Early REM per
the spec (line 71)."""
journal_path = Path(JOURNAL_DAILY)
if not journal_path.exists():
return []
new = []
for p in journal_path.rglob("*.md"):
try:
if p.stat().st_mtime > since_ts:
new.append(str(p.relative_to(journal_path)))
except OSError:
continue
return new
def _new_chunks_count(since_ts):
"""Files in the watcher state with mtime > last_dream. The spec calls
this 'what changed' (line 58). Used as the NREM novelty signal."""
state = _load_json(WATCHER_STATE, {})
count = 0
for _path, mtime in state.items():
try:
if float(mtime) > since_ts:
count += 1
except (ValueError, TypeError):
continue
return count
def _underprocessed_chunk_count():
"""Chunks below the underprocessed percentile by consolidation_count.
Biologically motivated: sharp-wave ripples bias replay toward novel /
under-encoded experience, not uniform sampling. We give NREM a pool of
'least-replayed' chunks to draw from in Stage 3."""
try:
pg = _get_pg()
cur = pg.cursor()
cur.execute(
"""
WITH t AS (
SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY consolidation_count)
AS threshold
FROM embeddings
)
SELECT COUNT(*) FROM embeddings, t
WHERE consolidation_count <= t.threshold
""",
(UNDERPROCESSED_PERCENTILE,),
)
result = cur.fetchone()[0]
pg.close()
return int(result or 0)
except Exception:
return 0
# ─── Stage 1: observe_corpus ────────────────────────────────────────────────
def observe_corpus():
"""Build the signal vector consumed by select_mode and (downstream) by
retrieve. Concrete observations only — no interpretation. Each key is
a direct measurement from the corpus, watcher, journal, or conversation
log.
Returns a dict with:
now_ts -- current Unix timestamp
last_dream_ts -- last completed dream timestamp (0 if never)
days_since_dream -- float; inf if never dreamed
new_chunks -- count of files newer than last_dream
new_journal_entries -- list of Journal/Daily/*.md filenames since last_dream
recent_questions -- user-turn content from last 14 days
underprocessed_count -- chunks in the bottom 25% by consolidation_count
"""
state = _load_json(DREAMER_STATE, {})
last_dream_ts = float(state.get("last_dream_timestamp", 0) or 0)
now_ts = datetime.now().timestamp()
return {
"now_ts": now_ts,
"last_dream_ts": last_dream_ts,
"days_since_dream": (now_ts - last_dream_ts) / 86400 if last_dream_ts else float("inf"),
"new_chunks": _new_chunks_count(last_dream_ts),
"new_journal_entries": _new_journal_entries(last_dream_ts),
"recent_questions": _recent_user_questions(),
"underprocessed_count": _underprocessed_chunk_count(),
}
# ─── Stage 2: select_mode ───────────────────────────────────────────────────
def select_mode(signal, task=None, explicit_mode=None):
"""Return one of {'nrem', 'early-rem', 'late-rem', 'lucid'}. Never None.
The dreamer fires every scheduled night. The earlier "go quiet on null
delta" rule was a synthesis-doc invention that didn't match the actual
desired UX — the original dreamer always dreamed, even if it repeated
itself. The cure for repetition lives in the retrieve layer
(LLM-generated queries from the observation signal, MMR diversity,
cursor bias toward under-processed chunks), not in skipping nights.
Routing logic:
- explicit_mode argument wins
- task supplied → 'lucid' (question-anchored)
- days_since_dream ≥ STALENESS_TRIGGER_DAYS → 'late-rem' (shake loose
via cross-domain pairs when nothing's been added in a while)
- new journal entry → 'early-rem' (emotional/personal register)
- default → 'nrem' (replay-and-consolidation; always has something to
do because the corpus always has under-processed chunks)
"""
if explicit_mode:
return explicit_mode
if task:
return "lucid"
days_since = signal["days_since_dream"]
new_journal = signal["new_journal_entries"]
if days_since >= STALENESS_TRIGGER_DAYS:
return "late-rem"
if new_journal:
return "early-rem"
return "nrem"
# ─── CLI for manual inspection ──────────────────────────────────────────────
if __name__ == "__main__":
signal = observe_corpus()
short = {k: v for k, v in signal.items() if k != "recent_questions"}
print("Signal (excluding recent_questions):")
print(json.dumps(short, indent=2, default=str))
print(f"\nRecent user questions ({len(signal['recent_questions'])}):")
for q in signal["recent_questions"][:5]:
print(f" - {q[:140]}")
mode = select_mode(signal)
print(f"\nselect_mode() → {mode!r}")
+331
View File
@@ -0,0 +1,331 @@
"""
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
- extract_blocks(filepath) — section-aware extraction (docx heading-bounded
sections, pptx per-slide, pdf/txt/md single-block)
- extract_text(filepath) — back-compat string concatenation over blocks
- chunk_text(text, chunk_size, overlap) — word-based blind chunking
- chunk_and_embed(text_or_blocks, source, embedder, filepath, folder) —
produce ready-to-write rows. Accepts str (blind) or list[dict] (section-aware).
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
"""
import hashlib
import json
import logging
import re
from pathlib import Path
from docx import Document as DocxDocument
from pypdf import PdfReader
from pptx import Presentation
log = logging.getLogger("encoding")
SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
DEFAULT_CHUNK_SIZE = 500
DEFAULT_CHUNK_OVERLAP = 50
_BOLD_KV_RE = re.compile(r"^\*\*[\w +/-]+?:\*\*")
def _strip_md_frontmatter(text: str) -> str:
"""Strip a leading frontmatter block from markdown, if present.
Recognizes two formats:
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
Only triggered when no heading precedes — guards against `---`
horizontal rules that follow an H1.
- Capture-style: optional H1 heading, then one or more `**key:** value`
lines (and blanks), terminated by `---`. The H1 is preserved; the
key/value block + separator are removed.
Body `---` rules and body `**bold:**` lines are never touched — the scan
aborts as soon as a non-frontmatter line appears in the leading block.
"""
lines = text.splitlines()
n = len(lines)
i = 0
while i < n and not lines[i].strip():
i += 1
heading = None
if i < n and lines[i].startswith("# "):
heading = lines[i]
i += 1
while i < n and not lines[i].strip():
i += 1
if i >= n:
return text
first = lines[i].strip()
if heading is None and first == "---":
j = i + 1
while j < n and lines[j].strip() != "---":
j += 1
if j >= n:
return text
body_start = j + 1
elif _BOLD_KV_RE.match(first):
j = i
while j < n:
s = lines[j].strip()
if not s or _BOLD_KV_RE.match(s):
j += 1
continue
if s == "---":
body_start = j + 1
break
return text
else:
return text
else:
return text
body = "\n".join(lines[body_start:]).lstrip("\n")
return f"{heading}\n\n{body}" if heading else body
def _docx_cell_paragraphs(cell):
yield from (p for p in cell.paragraphs if p.text.strip())
for nested in cell.tables:
for row in nested.rows:
for c in row.cells:
yield from _docx_cell_paragraphs(c)
def _pptx_shape_text(shape):
from pptx.enum.shapes import MSO_SHAPE_TYPE
parts = []
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
for sub in shape.shapes:
parts.extend(_pptx_shape_text(sub))
return parts
if hasattr(shape, "text") and shape.text.strip():
parts.append(shape.text)
if getattr(shape, "has_table", False):
for cell in shape.table.iter_cells():
if cell.text.strip():
parts.append(cell.text)
return parts
def _extract_docx_blocks(filepath: Path) -> list[dict]:
"""Return docx content as a single block. Earlier attempt at section-aware
chunking via Heading styles was rolled back: the user's docs are mostly
Normal-styled with bold-as-heading, and tying chunk boundaries to formatting
choices locks future-them into preserving those choices forever. Lexical
+ cross-encoder retrieval already finds the right substrings within a
blind-chunked CV, so the section structure isn't load-bearing for retrieval."""
from docx.oxml.ns import qn
doc = DocxDocument(filepath)
parts = [p.text for p in doc.paragraphs if p.text.strip()]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.extend(p.text for p in _docx_cell_paragraphs(cell))
for section in doc.sections:
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
for p in txbx.findall(".//" + qn("w:p")):
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
if text.strip():
parts.append(text)
text = "\n".join(parts)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
def _extract_pptx_blocks(filepath: Path) -> list[dict]:
"""One block per slide. Heading = slide title (or 'Slide N' fallback).
Body = non-title shape text + speaker notes."""
prs = Presentation(filepath)
blocks = []
for i, slide in enumerate(prs.slides, 1):
title_shape = None
try:
title_shape = slide.shapes.title
except (AttributeError, KeyError):
pass
title = None
body_parts = []
for shape in slide.shapes:
if title_shape is not None and shape == title_shape and shape.has_text_frame:
title = shape.text_frame.text.strip() or None
continue
body_parts.extend(_pptx_shape_text(shape))
if slide.has_notes_slide:
notes = slide.notes_slide.notes_text_frame.text
if notes.strip():
body_parts.append(f"[Notes] {notes}")
if title or body_parts:
blocks.append({
"heading": title or f"Slide {i}",
"text": "\n".join(body_parts),
"kind": "slide",
})
return blocks
def extract_blocks(filepath: Path) -> list[dict]:
"""Structured extraction. Returns list of {heading, text, kind} blocks.
- docx: section-aware via Heading-style paragraphs (kind='section').
- pptx: one block per slide (kind='slide').
- pdf/txt/md: single block, no heading (kind='doc').
Empty list on any failure or unsupported extension."""
suffix = filepath.suffix.lower()
try:
if suffix == ".docx":
return _extract_docx_blocks(filepath)
if suffix == ".pptx":
return _extract_pptx_blocks(filepath)
if suffix == ".pdf":
reader = PdfReader(filepath)
text = "".join(
page.extract_text() + "\n"
for page in reader.pages if page.extract_text()
)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
if suffix in {".txt", ".md"}:
text = filepath.read_text(encoding="utf-8", errors="ignore")
if suffix == ".md":
text = _strip_md_frontmatter(text)
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
except Exception as e:
log.warning(f"Extraction failed for {filepath.name}: {e}")
return []
def extract_text(filepath: Path) -> str:
"""Back-compat wrapper: concatenate extract_blocks() output. Section
structure is lost; use extract_blocks() directly for chunking."""
blocks = extract_blocks(filepath)
parts = []
for b in blocks:
if b.get("heading"):
parts.append(b["heading"])
if b.get("text"):
parts.append(b["text"])
return "\n".join(parts)
def chunk_text(text: str,
chunk_size: int = DEFAULT_CHUNK_SIZE,
overlap: int = DEFAULT_CHUNK_OVERLAP) -> list[str]:
"""Word-based chunking. Empty chunks filtered."""
words = text.split()
chunks = []
start = 0
while start < len(words):
chunk = " ".join(words[start:start + chunk_size])
if chunk.strip():
chunks.append(chunk)
start += chunk_size - overlap
return chunks
def _chunk_id(filepath, source: str, index: int) -> str:
basis = str(filepath) if filepath else source
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
def chunk_and_embed(text_or_blocks,
source: str,
embedder,
filepath=None,
folder=None) -> list[dict]:
"""Chunk + embed for write_embeddings_batch. Accepts either:
- str: blind chunking with 500-word windows (pdf/txt/md legacy path).
- list[dict]: section-aware path (docx Heading-bounded sections, pptx
slides). Each block emits one chunk if its text fits within
DEFAULT_CHUNK_SIZE words, otherwise is blind-split with overlap.
The block heading is prepended to the chunk text (so retrieval sees the
section context) and stored in metadata as heading/kind."""
if isinstance(text_or_blocks, str):
blocks = [{"heading": None, "text": text_or_blocks, "kind": "doc"}]
else:
blocks = text_or_blocks
chunks = []
for block in blocks:
body = block.get("text") or ""
heading = block.get("heading")
kind = block.get("kind", "doc")
if not body.strip() and not (heading and heading.strip()):
continue
if heading and body.strip():
contextualized = f"{heading}\n\n{body}"
elif heading:
contextualized = heading
else:
contextualized = body
if len(contextualized.split()) <= DEFAULT_CHUNK_SIZE:
chunks.append((contextualized, heading, kind))
else:
for sub in chunk_text(contextualized):
chunks.append((sub, heading, kind))
if not chunks:
return []
embeddings = embedder.encode([c[0] for c in chunks]).tolist()
rows = []
for i, ((chunk, heading, kind), emb) in enumerate(zip(chunks, embeddings)):
rows.append({
"id": _chunk_id(filepath, source, i),
"document": chunk,
"embedding": emb,
"source": source,
"type": "document",
"metadata": {
"source": source,
"filepath": str(filepath) if filepath else source,
"folder": folder,
"heading": heading,
"kind": kind,
},
})
return rows
def write_embeddings_batch(conn, batch: list[dict], commit: bool = True) -> int:
"""Single canonical INSERT. Sets created_at = NOW() server-side.
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
callers do not need to provide it. The application-layer assertion is the
primary enforcement point for type — the column lacks NOT NULL because
historical NULLs were resolved by the Improvement #2 backfill, and a
Python-level raise gives a faster, more debuggable failure than a
Postgres constraint error.
When commit=True (default), this function commits the connection itself.
When commit=False, the caller is responsible for committing. Use
commit=False when composing this write with other writes that must land
atomically in the same transaction.
"""
if not batch:
return 0
cur = conn.cursor()
for row in batch:
if not row.get("type"):
raise ValueError(
f"row {row.get('id')!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
type = EXCLUDED.type,
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
metadata = EXCLUDED.metadata
""", (row["id"], row["document"], row["embedding"],
row["source"], row["type"], json.dumps(row["metadata"])))
if commit:
conn.commit()
return len(batch)
@@ -0,0 +1,304 @@
"""Backfill embeddings.type and embeddings.created_at (Improvement #2 / A.3).
Idempotent on cohort predicates (every WHERE clause includes IS NULL on the
target column). Writes provenance to metadata.type_source and metadata.created_at_source
so each row is auditable and revertable per-source. Default --dry-run=True.
Order of batches:
T1. type backfill: WHERE type IS NULL -> 'document' (extension-classified, all hit).
C1. created_at: WHERE ca IS NULL AND metadata.filepath stat-resolves -> filesystem mtime.
C2. created_at: WHERE ca IS NULL AND source has unique watcher_state path -> watcher mtime.
C3. created_at: WHERE ca IS NULL AND source has watcher_state collision -> most-recent mtime.
C4. created_at: WHERE type='chatgpt_conversation' AND ca IS NULL -> export-resolved create_time.
C5. created_at: WHERE ca IS NULL (residual) -> sentinel.
Snapshot table embeddings_backup_2026_05_03 must exist before --apply.
Usage:
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py # dry-run
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py --apply # write
Exits non-zero if snapshot is missing on --apply.
"""
import argparse
import json
import os
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor, Json
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
SNAPSHOT_TABLE = "embeddings_backup_2026_05_03"
SENTINEL_ISO = "2026-04-26T00:00:00Z"
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(t):
bar = "=" * 70
print(f"\n{bar}\n{t}\n{bar}")
def fmt_ts_unix(ts):
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
def fmt_ts_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
state = json.loads(WATCHER_STATE.read_text())
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_name
def load_chatgpt_index():
if not CHATGPT_EXPORT_DIR.exists():
return {}
index = {}
for f in sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json")):
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception:
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
index[cid] = ct
return index
def assert_snapshot(cur):
cur.execute("SELECT to_regclass(%s) AS t;", (SNAPSHOT_TABLE,))
if cur.fetchone()["t"] is None:
print(f"ERROR: snapshot table '{SNAPSHOT_TABLE}' not found. Run A.2 first.")
sys.exit(2)
cur.execute(f"SELECT COUNT(*) AS n FROM {SNAPSHOT_TABLE};")
snap = cur.fetchone()["n"]
cur.execute("SELECT COUNT(*) AS n FROM embeddings;")
live = cur.fetchone()["n"]
print(f"snapshot {SNAPSHOT_TABLE}: {snap} rows; live embeddings: {live} rows")
if snap != live:
print(f"ERROR: snapshot row count != live ({snap} vs {live}). Refresh snapshot before --apply.")
sys.exit(2)
# ─── Batch primitive ────────────────────────────────────────────────────────
def run_batch(cur, label, candidates, apply_mode):
"""candidates: list of (id, set_type, set_ca, type_source, ca_source).
set_type / set_ca may be None to leave that column alone.
In dry-run we still execute UPDATEs inside an outer transaction (rolled back
at the end) so subsequent batches' SELECTs see the correct intermediate state."""
n = len(candidates)
print(f" {label}: {n} rows queued")
if n == 0:
return 0
for c in candidates[:3]:
print(f" sample: id={c[0]} type={c[1]!r} ca={c[2]!r} type_src={c[3]} ca_src={c[4]}")
n_written = 0
for row_id, set_type, set_ca, type_src, ca_src in candidates:
meta_patch = {}
if type_src:
meta_patch["type_source"] = type_src
if ca_src:
meta_patch["created_at_source"] = ca_src
# Build set list dynamically.
sets, params = [], []
if set_type is not None:
sets.append("type = %s")
params.append(set_type)
if set_ca is not None:
sets.append("created_at = %s")
params.append(set_ca)
if meta_patch:
sets.append("metadata = COALESCE(metadata, '{}'::jsonb) || %s::jsonb")
params.append(json.dumps(meta_patch))
params.append(row_id)
cur.execute(f"UPDATE embeddings SET {', '.join(sets)} WHERE id = %s;", params)
n_written += cur.rowcount
print(f" {n_written} rows updated{' (will rollback)' if not apply_mode else ''}")
return n_written
# ─── Batches ────────────────────────────────────────────────────────────────
def batch_T1_type(cur, apply_mode):
"""type IS NULL -> 'document'. All cohort A rows have a SUPPORTED extension."""
cur.execute("""
SELECT id, source FROM embeddings WHERE type IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], "document", None, "inferred_extension", None) for r in rows]
return run_batch(cur, "T1 type IS NULL -> 'document'", cands, apply_mode)
def batch_C1_filepath_stat(cur, apply_mode):
"""ca IS NULL AND metadata.filepath stat-resolves -> mtime."""
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL
ORDER BY id;
""")
rows = cur.fetchall()
cands, n_skipped_missing = [], 0
for r in rows:
p = Path(r["fp"])
if p.exists():
mt = fmt_ts_mtime(p)
if mt:
cands.append((r["id"], None, mt, None, "filepath_stat"))
continue
n_skipped_missing += 1
print(f" C1 candidates: {len(cands)} (skipped {n_skipped_missing} where filepath gone or unstattable)")
return run_batch(cur, "C1 ca IS NULL AND filepath stat-resolves -> mtime", cands, apply_mode)
def batch_C2_C3_watcher_state(cur, apply_mode):
"""ca IS NULL AND filepath unresolvable -> watcher_state by source basename.
C2 = unique hit, C3 = collision pick-latest."""
by_name = load_watcher_state()
cur.execute("""
SELECT id, source, metadata->>'filepath' AS fp
FROM embeddings
WHERE created_at IS NULL
ORDER BY id;
""")
rows = cur.fetchall()
c2, c3 = [], []
skipped_no_match = 0
for r in rows:
# skip rows already targeted by C1 path
if r["fp"] and Path(r["fp"]).exists():
continue
src = r["source"]
if not src or src not in by_name:
skipped_no_match += 1
continue
candidates = by_name[src]
if len(candidates) == 1:
mt = fmt_ts_unix(candidates[0][1])
c2.append((r["id"], None, mt, None, "watcher_state_unique"))
else:
latest = max(candidates, key=lambda x: float(x[1]))
mt = fmt_ts_unix(latest[1])
c3.append((r["id"], None, mt, None, f"watcher_state_collision_pick_latest_of_{len(candidates)}"))
print(f" C2/C3 source-basename fallback: {len(c2)} unique, {len(c3)} collision, "
f"{skipped_no_match} unmatched (will fall to C4/C5)")
n2 = run_batch(cur, "C2 ca IS NULL AND watcher_state unique -> mtime", c2, apply_mode)
n3 = run_batch(cur, "C3 ca IS NULL AND watcher_state collision -> latest mtime", c3, apply_mode)
return n2 + n3
def batch_C4_chatgpt_export(cur, apply_mode):
index = load_chatgpt_index()
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands, unresolved = [], 0
for r in rows:
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = index.get(cid)
if ct is None:
unresolved += 1
continue
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
cands.append((r["id"], None, ct_iso, None, "chatgpt_export"))
print(f" C4 chatgpt export resolution: {len(cands)} resolved, {unresolved} unresolved (fall to C5)")
return run_batch(cur, "C4 type='chatgpt_conversation' AND ca IS NULL -> export create_time", cands, apply_mode)
def batch_C5_sentinel(cur, apply_mode):
cur.execute("""
SELECT id, type, source FROM embeddings WHERE created_at IS NULL ORDER BY id;
""")
rows = cur.fetchall()
cands = [(r["id"], None, SENTINEL_ISO, None, "sentinel") for r in rows]
if cands:
sample_types = Counter(r["type"] for r in rows)
print(f" C5 residual sentinel rows by type: {dict(sample_types)}")
return run_batch(cur, f"C5 ca IS NULL residual -> sentinel {SENTINEL_ISO}", cands, apply_mode)
# ─── Pre/post counts ────────────────────────────────────────────────────────
def print_counts(cur, label):
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null
FROM embeddings;
""")
r = cur.fetchone()
print(f" [{label}] total={r['total']} type_null={r['type_null']} ca_null={r['ca_null']}")
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true", help="default false (dry-run)")
args = ap.parse_args()
apply_mode = args.apply
pg = get_pg()
cur = pg.cursor()
print(f"Mode: {'APPLY (writes will commit)' if apply_mode else 'DRY-RUN (no writes)'}")
print(f"Sentinel: {SENTINEL_ISO}")
if apply_mode:
assert_snapshot(cur)
header("PRE-COUNTS")
print_counts(cur, "before")
header("BATCHES")
n_t1 = batch_T1_type(cur, apply_mode)
n_c1 = batch_C1_filepath_stat(cur, apply_mode)
n_c2c3 = batch_C2_C3_watcher_state(cur, apply_mode)
n_c4 = batch_C4_chatgpt_export(cur, apply_mode)
n_c5 = batch_C5_sentinel(cur, apply_mode)
header("POST-COUNTS")
print_counts(cur, "after" if apply_mode else "after (in-transaction, will rollback)")
if apply_mode:
pg.commit()
print("\nCOMMITTED.")
else:
pg.rollback()
print("\nROLLED BACK (dry-run).")
print(f"\nSummary: T1={n_t1} C1={n_c1} C2+C3={n_c2c3} C4={n_c4} C5={n_c5}")
pg.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,557 @@
"""Read-only inspection for the embeddings.type / embeddings.created_at backfill (Improvement #2 / A.1).
Produces a survey of every backfill source-of-truth question without writing
to the database. Output is a human-readable report on stdout plus a JSON
sidecar at experiments/embeddings_backfill_inspection_<date>.json.
Sections:
1. Cohort recap (counts; should match prior investigation).
2. Cohort A type inference: extension classifier coverage.
3. created_at inference for cohort A + B-doc-old:
- rows with metadata.filepath: stat the file, check existence.
- rows without filepath: lookup source against watcher_state.json.
- filename-collision shape audit (live+backup, live+archive, ambiguous).
4. ChatGPT export resolution (Plan A.1 addition #1):
- existence of /home/aaron/nextcloud/.../ChatGPT Export/.
- sample 5 B-chatgpt rows; resolve convo_id -> create_time.
5. Sentinel date discovery (Plan A.1 addition #3):
- earliest non-NULL created_at per type (already-populated rows are the
lower bound for when the substrate started carrying timestamps).
- git log for the pgvector migration commit.
- any ChromaDB sqlite still on disk.
- propose a sentinel with reasoning, or flag as arbitrary.
6. 50-row stratified sample: derived (type, created_at, source) per row.
Usage: venv/bin/python3 scripts/experiments/embeddings_backfill_inspection.py
Read-only. No DB writes. No filesystem writes outside experiments/.
"""
import json
import os
import random
import re
import subprocess
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import psycopg2
from psycopg2.extras import RealDictCursor
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env")
PG_DSN = os.getenv("PG_DSN")
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
NEXTCLOUD_ROOT = Path("/home/aaron/nextcloud/data/data/aaron/files")
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"embeddings_backfill_inspection_{datetime.now().strftime('%Y-%m-%d')}.json"
SUPPORTED_EXT = {".pdf", ".docx", ".pptx", ".txt", ".md"}
random.seed(20260503)
# ─── Helpers ────────────────────────────────────────────────────────────────
def get_pg():
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
def header(title):
bar = "=" * 70
print(f"\n{bar}\n{title}\n{bar}")
def sub(title):
print(f"\n--- {title} ---")
def fmt_ts_from_unix(ts):
"""Watcher state stores unix timestamps as strings."""
try:
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def fmt_ts_from_st_mtime(p):
try:
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
return None
def load_watcher_state():
"""Returns (path -> mtime_str), and (basename -> [(path, mtime_str), ...])."""
state = json.loads(WATCHER_STATE.read_text())
by_path = state
by_name = defaultdict(list)
for path, mtime in state.items():
by_name[Path(path).name].append((path, mtime))
return by_path, by_name
def classify_collision_shape(paths):
"""Categorize a filename-collision group:
- 'live+backup' : exactly one path doesn't contain backup/.bak markers
and others do
- 'live+archive' : exactly one is outside Archive/ and others are inside
- 'multi-live' : >=2 paths look like live (no backup/archive markers)
- 'all-archive' : every path is inside Archive/ or backup-like
- 'other'
"""
def is_backup(p):
s = p.lower()
return ".bak" in s or "/backup" in s or "backups/" in s
def is_archive(p):
s = p.lower()
return "/archive/" in s
backups = [p for p in paths if is_backup(p)]
archives = [p for p in paths if is_archive(p)]
live = [p for p in paths if not is_backup(p) and not is_archive(p)]
if len(live) == 1 and len(backups) >= 1 and len(archives) == 0:
return "live+backup"
if len(live) == 1 and len(archives) >= 1 and len(backups) == 0:
return "live+archive"
if len(live) == 1 and (len(backups) + len(archives)) >= 1:
return "live+mixed-old"
if len(live) >= 2:
return "multi-live"
if len(live) == 0:
return "all-archive-or-backup"
return "other"
# ─── Section 1: Cohort recap ────────────────────────────────────────────────
def section_1_cohort_recap(cur):
header("1. COHORT RECAP")
cur.execute("""
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null,
COUNT(*) FILTER (WHERE type IS NULL AND created_at IS NULL) AS both_null,
COUNT(*) FILTER (WHERE type IS NOT NULL AND created_at IS NOT NULL) AS both_set
FROM embeddings;
""")
overall = cur.fetchone()
print(f"Total: {overall['total']} type_null: {overall['type_null']} "
f"ca_null: {overall['ca_null']} both_null: {overall['both_null']} "
f"both_set: {overall['both_set']}")
cur.execute("""
SELECT type, created_at IS NULL AS ca_null, COUNT(*) AS n
FROM embeddings GROUP BY type, ca_null ORDER BY type NULLS LAST, ca_null;
""")
cohorts = cur.fetchall()
sub("Per-(type, ca_null) cohorts")
for r in cohorts:
print(f" type={r['type'] or 'NULL':<22} ca_null={r['ca_null']!s:<5} n={r['n']}")
return {"overall": overall, "cohorts": cohorts}
# ─── Section 2: Cohort A type inference ─────────────────────────────────────
def section_2_type_inference(cur):
header("2. COHORT A TYPE INFERENCE (extension classifier)")
cur.execute("""
SELECT LOWER(SUBSTRING(source FROM '\.[^.]+$')) AS ext, COUNT(*) AS rows
FROM embeddings WHERE type IS NULL
GROUP BY ext ORDER BY rows DESC;
""")
by_ext = cur.fetchall()
classified = sum(r["rows"] for r in by_ext if r["ext"] in SUPPORTED_EXT)
unknown = sum(r["rows"] for r in by_ext if r["ext"] not in SUPPORTED_EXT)
print(f"NULL-type rows by extension:")
for r in by_ext:
flag = "OK" if r["ext"] in SUPPORTED_EXT else "??"
print(f" {flag} {r['ext'] or '(none)':<8} rows={r['rows']}")
print(f"\nClassified as 'document' via extension: {classified}")
print(f"Unclassifiable (no SUPPORTED extension): {unknown}")
return {"by_ext": by_ext, "classified": classified, "unclassifiable": unknown}
# ─── Section 3: created_at inference ────────────────────────────────────────
def section_3_created_at_inference(cur):
header("3. CREATED_AT INFERENCE — file-derived rows")
by_path, by_name = load_watcher_state()
print(f"watcher_state.json: {len(by_path)} tracked paths, "
f"{len(by_name)} distinct filenames, "
f"{sum(1 for v in by_name.values() if len(v) > 1)} filename collisions")
# 3a. Rows with metadata.filepath: probe stat()
sub("3a. Rows with metadata.filepath — stat probe")
cur.execute("""
SELECT id, source, metadata->>'filepath' AS filepath
FROM embeddings
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL;
""")
rows_with_fp = cur.fetchall()
fp_exists = 0
fp_missing = 0
fp_outside_root = 0
sample_resolved = []
for r in rows_with_fp:
p = Path(r["filepath"])
if not str(p).startswith(str(NEXTCLOUD_ROOT)):
fp_outside_root += 1
if p.exists():
fp_exists += 1
if len(sample_resolved) < 5:
sample_resolved.append({
"id": r["id"], "source": r["source"],
"filepath": str(p), "mtime": fmt_ts_from_st_mtime(p),
})
else:
fp_missing += 1
print(f" rows with metadata.filepath: {len(rows_with_fp)}")
print(f" exists on disk: {fp_exists}")
print(f" missing on disk: {fp_missing}")
print(f" outside Nextcloud root: {fp_outside_root}")
print(f" Sample of 5 resolved mtimes:")
for s in sample_resolved:
print(f" {s['id']:<15} {s['source'][:60]:<60} mtime={s['mtime']}")
# 3b. Rows without metadata.filepath: watcher_state lookup
sub("3b. Rows without metadata.filepath — watcher_state lookup")
cur.execute("""
SELECT id, source FROM embeddings
WHERE created_at IS NULL
AND metadata->>'filepath' IS NULL
AND type IS NULL OR (type='document' AND created_at IS NULL AND metadata->>'filepath' IS NULL);
""")
rows_no_fp = cur.fetchall()
# Distinct source basenames to look up
basenames_to_resolve = sorted({r["source"] for r in rows_no_fp if r["source"]})
n_resolved_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) == 1)
n_collision_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) > 1)
n_unfound = sum(1 for n in basenames_to_resolve if n not in by_name)
print(f" rows without filepath: {len(rows_no_fp)}")
print(f" distinct source basenames to resolve: {len(basenames_to_resolve)}")
print(f" unique watcher_state hit (no collision): {n_resolved_unique}")
print(f" collision in watcher_state (>1 path): {n_collision_unique}")
print(f" not in watcher_state at all: {n_unfound}")
# 3c. Collision-shape audit
sub("3c. Collision-shape audit — all collisions in watcher_state")
collisions = {n: [(p, m) for p, m in by_name[n]] for n in by_name if len(by_name[n]) > 1}
shape_counts = Counter()
rows_affected_by_shape = Counter()
# Map from basename to count of NULL-ca rows that need it (rows_no_fp)
rows_no_fp_by_name = Counter(r["source"] for r in rows_no_fp)
sample_per_shape = defaultdict(list)
for name, paths_mtimes in collisions.items():
paths = [p for p, _ in paths_mtimes]
shape = classify_collision_shape(paths)
shape_counts[shape] += 1
rows_affected_by_shape[shape] += rows_no_fp_by_name.get(name, 0)
if len(sample_per_shape[shape]) < 3:
entry = {
"name": name,
"rows_no_fp_using_this_name": rows_no_fp_by_name.get(name, 0),
"candidates": [
{"path": p, "mtime": fmt_ts_from_unix(m)}
for p, m in sorted(paths_mtimes, key=lambda x: -float(x[1]))
],
}
sample_per_shape[shape].append(entry)
print(f" collisions in watcher_state: {len(collisions)}")
print(f" shape breakdown:")
for shape, n in shape_counts.most_common():
print(f" {shape:<22} collisions={n:<4} rows_affected={rows_affected_by_shape[shape]}")
print(f"\n Up-to-3 sample collisions per shape (sorted by mtime desc):")
for shape, samples in sample_per_shape.items():
print(f" [{shape}]")
for s in samples:
print(f" {s['name']} (rows_no_fp using this name: {s['rows_no_fp_using_this_name']})")
for c in s["candidates"]:
print(f" {c['mtime']} {c['path']}")
return {
"watcher_state_paths": len(by_path),
"watcher_state_basenames": len(by_name),
"watcher_state_collisions": len(collisions),
"rows_with_filepath": {
"total": len(rows_with_fp),
"exists": fp_exists, "missing": fp_missing,
"outside_root": fp_outside_root,
"sample": sample_resolved,
},
"rows_without_filepath": {
"total": len(rows_no_fp),
"distinct_basenames": len(basenames_to_resolve),
"unique_hit": n_resolved_unique,
"collision_hit": n_collision_unique,
"unfound": n_unfound,
},
"collision_shapes": {
"total": len(collisions),
"shape_counts": dict(shape_counts),
"rows_affected_by_shape": dict(rows_affected_by_shape),
"samples": {k: v for k, v in sample_per_shape.items()},
},
}
# ─── Section 4: ChatGPT export resolution ───────────────────────────────────
def section_4_chatgpt_export(cur):
header("4. CHATGPT EXPORT RESOLUTION (Plan addition #1)")
print(f"Probing: {CHATGPT_EXPORT_DIR}")
if not CHATGPT_EXPORT_DIR.exists():
print(" NOT FOUND — plan on sentinel for entire B-chatgpt cohort.")
return {"export_dir_exists": False, "files": []}
files = sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json"))
print(f" found {len(files)} export file(s):")
for f in files:
print(f" {f.name} size={f.stat().st_size:,} mtime={fmt_ts_from_st_mtime(f)}")
# Build convo_id -> create_time index from all export files.
print("\nLoading export(s) to build convo_id -> create_time index...")
convo_index = {}
for f in files:
try:
data = json.loads(f.read_text(encoding="utf-8"))
except Exception as e:
print(f" failed to parse {f.name}: {e}")
continue
for convo in data:
cid = convo.get("id") or convo.get("conversation_id")
ct = convo.get("create_time")
if cid and ct is not None:
convo_index[cid] = ct
print(f" indexed {len(convo_index)} conversations across {len(files)} export files")
# Sample 5 chatgpt_conversation rows; resolve.
cur.execute("""
SELECT id, source FROM embeddings
WHERE type='chatgpt_conversation' AND created_at IS NULL
ORDER BY random() LIMIT 5;
""")
sample = cur.fetchall()
sub("Sample of 5 B-chatgpt rows: convo lookup")
resolved = 0
sample_results = []
for r in sample:
# IDs look like chatgpt_<uuid>_<idx>; uuid extends until last underscore.
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
cid = m.group(1) if m else None
ct = convo_index.get(cid)
ct_iso = None
if ct is not None:
try:
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
except Exception:
ct_iso = None
if ct_iso:
resolved += 1
sample_results.append({
"id": r["id"], "source": r["source"], "convo_id": cid,
"create_time": ct, "create_time_iso": ct_iso,
"resolved": ct_iso is not None,
})
print(f" {r['id']} cid={cid}")
print(f" -> create_time={ct} iso={ct_iso}")
print(f"\nResolved {resolved}/5. "
f"{'PROCEED with re-derive for full cohort.' if resolved == 5 else 'PARTIAL — plan re-derive + sentinel for unresolved.'}")
# Estimate full-cohort coverage by counting how many B-chatgpt convo_ids appear in the index.
cur.execute("""
SELECT DISTINCT regexp_replace(id, '^chatgpt_(.+)_\\d+$', '\\1') AS cid
FROM embeddings WHERE type='chatgpt_conversation' AND created_at IS NULL;
""")
distinct_cids = [r["cid"] for r in cur.fetchall()]
in_index = sum(1 for c in distinct_cids if c in convo_index)
print(f"Full-cohort coverage estimate: {in_index} / {len(distinct_cids)} distinct convo_ids "
f"resolvable from export.")
return {
"export_dir_exists": True,
"files": [{"name": f.name, "size": f.stat().st_size, "mtime": fmt_ts_from_st_mtime(f)} for f in files],
"convo_index_size": len(convo_index),
"sample_results": sample_results,
"sample_resolved": resolved,
"full_cohort": {
"distinct_convo_ids": len(distinct_cids),
"resolvable_from_export": in_index,
"unresolvable": len(distinct_cids) - in_index,
},
}
# ─── Section 5: Sentinel date discovery ─────────────────────────────────────
def section_5_sentinel(cur):
header("5. SENTINEL DATE DISCOVERY (Plan addition #3)")
# 5a. Earliest non-NULL created_at per type: lower bound on substrate age.
sub("5a. Earliest non-NULL created_at per type")
cur.execute("""
SELECT type, MIN(created_at) AS earliest, MAX(created_at) AS latest, COUNT(*) AS rows
FROM embeddings WHERE created_at IS NOT NULL GROUP BY type ORDER BY type;
""")
rows = cur.fetchall()
for r in rows:
print(f" {r['type']:<22} earliest={r['earliest']:<32} latest={r['latest']}")
# 5b. git log for the pgvector-migration commit.
sub("5b. Git log — pgvector migration commits")
git_findings = []
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s",
"--", "deprecated/migrate_to_pgvector.py", "scripts/migrate_to_pgvector.py"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
for line in out.stdout.strip().splitlines():
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log failed: {e}")
# Also: when did the api/ingest scripts cut over to pgvector?
try:
out = subprocess.run(
["git", "log", "--all", "--format=%H %ci %s", "--grep=pgvector", "-i"],
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
)
print("\n Commits mentioning pgvector:")
for line in out.stdout.strip().splitlines()[:10]:
print(f" {line}")
git_findings.append(line)
except Exception as e:
print(f" git log (pgvector grep) failed: {e}")
# 5c. ChromaDB sqlite still on disk?
sub("5c. ChromaDB dump on disk?")
candidates = []
for root in [Path.home() / "aaronai", Path.home() / "aaronai" / "db"]:
if root.exists():
for p in root.rglob("chroma*.sqlite*"):
candidates.append({"path": str(p), "mtime": fmt_ts_from_st_mtime(p)})
if candidates:
for c in candidates:
print(f" found: {c['path']} mtime={c['mtime']}")
else:
print(" no ChromaDB sqlite found under ~/aaronai")
# 5d. Propose sentinel.
sub("5d. Sentinel proposal")
# Earliest doc cutover: per query, document=2026-04-30. Migration commit f78b830 was
# 2026-04-26. Most defensible sentinel for "rows that entered pgvector before NOW()
# writes were canonical" = the migration commit date.
proposed = "2026-04-26T00:00:00Z"
reasoning = (
"git f78b830 'Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, "
"dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL "
"created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL "
"created_at all predate F11 and most predate the pgvector cutover itself. "
"2026-04-26 is the date the ChromaDB->pgvector migration script was committed, "
"so any row currently in the embeddings table with NULL created_at must have been "
"ingested on or after that date (when the table came into existence in current form). "
"It is the tightest defensible upper bound on 'the row entered pgvector before "
"timestamps were tracked', so it is the right sentinel."
)
print(f" Proposed sentinel: {proposed}")
print(f" Reasoning: {reasoning}")
return {
"earliest_per_type": rows,
"git_findings": git_findings,
"chromadb_candidates": candidates,
"proposed_sentinel": proposed,
"reasoning": reasoning,
}
# ─── Section 6: 50-row stratified sample ────────────────────────────────────
def section_6_stratified_sample(cur, sentinel_iso):
header("6. 50-ROW STRATIFIED SAMPLE — derived (type, created_at, source)")
by_path, by_name = load_watcher_state()
cohorts = [
("A (type NULL, ca NULL)", "type IS NULL AND created_at IS NULL", 10),
("B-doc-old (type='document', ca NULL)", "type='document' AND created_at IS NULL", 10),
("B-chatgpt (type='chatgpt_conversation', ca NULL)", "type='chatgpt_conversation' AND created_at IS NULL", 10),
("C-doc-new (type='document', ca set)", "type='document' AND created_at IS NOT NULL", 10),
("C-claude (type='claude_conversation', ca set)", "type='claude_conversation' AND created_at IS NOT NULL", 5),
("C-aaronai (type='aaronai_conversation', ca set)", "type='aaronai_conversation' AND created_at IS NOT NULL", 5),
]
samples = []
for label, predicate, n in cohorts:
sub(f"{label} (sample size: {n})")
cur.execute(f"""
SELECT id, source, type, created_at, metadata
FROM embeddings WHERE {predicate}
ORDER BY random() LIMIT %s;
""", (n,))
rows = cur.fetchall()
for r in rows:
row_meta = r["metadata"] or {}
fp = row_meta.get("filepath")
inferred_type = r["type"] or ("document" if (r["source"] or "").lower().endswith(tuple(SUPPORTED_EXT)) else "?")
inferred_ca = r["created_at"]
inferred_ca_source = "preserved" if inferred_ca else None
if not inferred_ca:
if fp and Path(fp).exists():
inferred_ca = fmt_ts_from_st_mtime(Path(fp))
inferred_ca_source = "filepath_stat"
elif r["source"] and r["source"] in by_name:
candidates = by_name[r["source"]]
if len(candidates) == 1:
inferred_ca = fmt_ts_from_unix(candidates[0][1])
inferred_ca_source = "watcher_state_unique"
else:
# take most recent
latest = max(candidates, key=lambda x: float(x[1]))
inferred_ca = fmt_ts_from_unix(latest[1])
inferred_ca_source = f"watcher_state_collision_pick_latest_of_{len(candidates)}"
else:
inferred_ca = sentinel_iso
inferred_ca_source = "sentinel"
print(f" id={r['id']:<22} src={(r['source'] or '')[:38]:<38}")
print(f" existing: type={r['type']!r:<22} ca={r['created_at']!r}")
print(f" inferred: type={inferred_type!r:<22} ca={inferred_ca!r} ({inferred_ca_source})")
samples.append({
"cohort": label, "id": r["id"], "source": r["source"],
"existing_type": r["type"], "existing_ca": r["created_at"],
"inferred_type": inferred_type, "inferred_ca": inferred_ca,
"inferred_ca_source": inferred_ca_source,
})
return samples
# ─── Driver ─────────────────────────────────────────────────────────────────
def main():
pg = get_pg()
cur = pg.cursor()
out = {"generated_at": datetime.now(timezone.utc).isoformat()}
out["section_1"] = section_1_cohort_recap(cur)
out["section_2"] = section_2_type_inference(cur)
out["section_3"] = section_3_created_at_inference(cur)
out["section_4"] = section_4_chatgpt_export(cur)
out["section_5"] = section_5_sentinel(cur)
sentinel_iso = out["section_5"]["proposed_sentinel"]
out["section_6"] = section_6_stratified_sample(cur, sentinel_iso)
pg.close()
# JSON sidecar — strip non-serializables.
def _serialize(o):
if isinstance(o, datetime):
return o.isoformat()
return str(o)
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
OUT_PATH.write_text(json.dumps(out, indent=2, default=_serialize))
print(f"\nJSON sidecar written: {OUT_PATH}")
if __name__ == "__main__":
main()
@@ -0,0 +1,296 @@
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
sidecar for diffing across runs.
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
"""
import os
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
import psycopg2
from dotenv import load_dotenv
load_dotenv()
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
def normalize(label):
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
def folder_bin(source):
"""Classify source by type. stage_3_queue stores bare filenames, so we
bin by what kind of file it is, not where it lives in the tree."""
if not source:
return "unknown"
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
return "conversation" # bypasses Stage 2/3, will not appear here
s = source.lower()
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
return "voice_note"
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
return "dream_output"
if s.endswith(".md"):
return "markdown"
if s.endswith(".pdf"):
return "pdf"
if s.endswith(".docx") or s.endswith(".doc"):
return "docx"
if s.endswith(".pptx") or s.endswith(".ppt"):
return "pptx"
if s.endswith(".txt"):
return "txt"
return "other"
def fetch_rows(cur):
cur.execute("""
SELECT source, char_length, active_frames, worker_version, raw_metadata
FROM stage2_frames_v
""")
rows = []
for source, char_length, frames, worker_version, raw in cur.fetchall():
if not isinstance(frames, list):
continue
rows.append({
"source": source,
"char_length": char_length,
"frames": [str(f) for f in frames if f],
"worker_version": worker_version,
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
})
return rows
def section_frequency(rows):
counter = Counter()
for r in rows:
for f in r["frames"]:
counter[f] += 1
return counter
def section_hygiene(frequency):
"""Group raw labels by normalized form; flag collisions."""
groups = defaultdict(list)
for raw, count in frequency.items():
groups[normalize(raw)].append((raw, count))
collisions = {k: v for k, v in groups.items() if len(v) > 1}
return collisions
def section_per_doc_count(rows):
counts = Counter(len(r["frames"]) for r in rows)
return counts
def section_cooccurrence(rows, top_frames):
top_set = set(top_frames)
pair_counts = Counter()
for r in rows:
present = [f for f in r["frames"] if f in top_set]
for i in range(len(present)):
for j in range(i + 1, len(present)):
a, b = sorted([present[i], present[j]])
pair_counts[(a, b)] += 1
return pair_counts
def section_folder_crosstab(rows, top_frames):
top_set = set(top_frames)
table = defaultdict(Counter) # frame -> bin -> count
bin_totals = Counter()
for r in rows:
b = folder_bin(r["source"])
bin_totals[b] += 1
for f in r["frames"]:
if f in top_set:
table[f][b] += 1
return table, bin_totals
def section_worker_versions(rows):
counter = Counter(r["worker_version"] or "unknown" for r in rows)
raw_keys_by_version = defaultdict(Counter)
for r in rows:
v = r["worker_version"] or "unknown"
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
return counter, raw_keys_by_version
def section_data_gap(cur):
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
cur.execute("""
SELECT source, char_length
FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
missing = cur.fetchall()
by_bin = Counter(folder_bin(s) for s, _ in missing)
char_lengths = [c for _, c in missing]
return {
"count": len(missing),
"by_type_bin": dict(by_bin),
"char_length": {
"min": min(char_lengths) if char_lengths else None,
"max": max(char_lengths) if char_lengths else None,
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
},
"sample_sources": [s for s, _ in missing[:10]],
}
def section_corpus_coverage(cur):
"""How much of the embeddings corpus has frame coverage?"""
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
total = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM embeddings
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
""")
conversations = cur.fetchone()[0]
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
with_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE completed_at IS NOT NULL AND char_length < 2000
""")
short_no_frames = cur.fetchone()[0]
cur.execute("""
SELECT count(DISTINCT source) FROM stage_2_queue
WHERE failed_at IS NOT NULL
""")
failed = cur.fetchone()[0]
return {
"total_distinct_sources_in_embeddings": total,
"conversations_no_frames_by_design": conversations,
"files_with_frames": with_frames,
"files_short_no_frames": short_no_frames,
"files_stage2_failed": failed,
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
}
def main():
conn = psycopg2.connect(os.environ["PG_DSN"])
cur = conn.cursor()
rows = fetch_rows(cur)
n_docs = len(rows)
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
# 1. Frequency
freq = section_frequency(rows)
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
for label, count in freq.most_common(30):
print(f" {count:5d} {label}")
print()
# 2. Hygiene
collisions = section_hygiene(freq)
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
print(f" '{norm}': {variant_str}")
print()
# 3. Per-doc frame count
per_doc = section_per_doc_count(rows)
print("--- 3. Per-doc frame count ---")
for n in sorted(per_doc):
print(f" {n} frames: {per_doc[n]} docs")
print()
# 4. Co-occurrence (top-K)
top_frames = [f for f, _ in freq.most_common(TOP_K)]
pairs = section_cooccurrence(rows, top_frames)
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
for (a, b), count in pairs.most_common(30):
print(f" {count:4d} {a} × {b}")
print()
# 5. Folder cross-tab
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
bins_sorted = [b for b, _ in bin_totals.most_common()]
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
for f in top_frames:
row_data = crosstab[f]
if not row_data:
continue
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
print(f" {f}: {cells}")
print()
# 6. Worker versions
versions, keys_by_version = section_worker_versions(rows)
print("--- 6. Worker version split ---")
for v, count in versions.most_common():
print(f" v{v}: {count} docs")
top_shapes = keys_by_version[v].most_common(3)
for keys, kcount in top_shapes:
print(f" {kcount} docs with keys={list(keys)}")
print()
# 7. Data gap
gap = section_data_gap(cur)
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
print(f" count: {gap['count']}")
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
print(f" by type bin: {gap['by_type_bin']}")
print(f" sample sources: {gap['sample_sources']}")
print()
# 8. Corpus coverage
coverage = section_corpus_coverage(cur)
print("--- 8. Corpus-wide frame coverage ---")
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
print(f" files with frames: {coverage['files_with_frames']}")
print(f" files short, no frames: {coverage['files_short_no_frames']}")
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
print()
# JSON sidecar
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
sidecar = {
"generated_at": datetime.now().isoformat(),
"n_docs_with_frames": n_docs,
"n_distinct_labels": len(freq),
"top_30_frames": freq.most_common(30),
"label_collisions": {
k: [(r, c) for r, c in v] for k, v in collisions.items()
},
"per_doc_frame_count": dict(per_doc),
"top_30_pairs": [
{"a": a, "b": b, "count": c}
for (a, b), c in pairs.most_common(30)
],
"folder_crosstab": {
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
},
"bin_totals": dict(bin_totals),
"worker_versions": dict(versions),
"data_gap": gap,
"corpus_coverage": coverage,
}
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
print(f"JSON sidecar written: {OUT_PATH}")
cur.close()
conn.close()
if __name__ == "__main__":
main()
+30
View File
@@ -0,0 +1,30 @@
"""
Aaron AI ingest_failures helpers — shared by watcher.py and ingest.py.
Both modules write structured failure rows so the SettingsPanel "Ingest Health"
view sees the same shape regardless of ingest path. Functions take an explicit
conn parameter; the caller decides transaction boundaries and exception
handling. Both current callers wrap with their own log-and-swallow shims.
"""
def record_ingest_failure(conn, source: str, filepath, error: str) -> None:
"""Insert or update an ingest_failures row. Commits."""
cur = conn.cursor()
cur.execute("""
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
VALUES (%s, %s, %s, 0, NOW(), NOW())
ON CONFLICT (source) DO UPDATE SET
error = EXCLUDED.error,
retry_count = ingest_failures.retry_count + 1,
last_failed_at = NOW(),
resolved = FALSE
""", (source, str(filepath), error[:1000]))
conn.commit()
def resolve_ingest_failure(conn, source: str) -> None:
"""Mark a previously failed source as resolved. Commits."""
cur = conn.cursor()
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,))
conn.commit()
+11
View File
@@ -75,6 +75,17 @@ async def lifespan(app: FastAPI):
max_coroutines=2, max_coroutines=2,
) )
await graphiti_instance.build_indices_and_constraints() await graphiti_instance.build_indices_and_constraints()
# Bridge driver._search_ops to driver.search_interface — graphiti-core 0.29.0
# builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__
# but never assigns it to driver.search_interface. search_utils.py dispatches
# on driver.search_interface; without this assignment it falls back to
# interpreted-Cypher cosine math (full table scans). Together with the
# vendored patches in graphiti_patches/, this activates FalkorDB's native
# vector index for entity dedup similarity search.
if (hasattr(graphiti_instance.driver, "_search_ops")
and graphiti_instance.driver.search_interface is None):
graphiti_instance.driver.search_interface = graphiti_instance.driver._search_ops
log.info("Wired driver.search_interface = driver._search_ops (vector index path active)")
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}") log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
yield yield
await graphiti_instance.close() await graphiti_instance.close()
+131 -131
View File
@@ -1,70 +1,37 @@
"""
Aaron AI bulk ingester. Two entry points:
- ingest_directory(folder, embedder=None) — programmatic; called from
api.py /api/reindex with the api process's shared embedder
- python3 scripts/ingest.py <folder> — CLI back-compat; loads its own embedder
Stage 1 helpers (extract / chunk / embed / write) live in scripts/encoding.py.
Failure tracking SQL lives in scripts/failures.py.
"""
import os import os
import sys import sys
import hashlib
from pathlib import Path from pathlib import Path
from dotenv import load_dotenv from dotenv import load_dotenv
import psycopg2 import psycopg2
import psycopg2.extras
import json
from sentence_transformers import SentenceTransformer from sentence_transformers import SentenceTransformer
from docx import Document
from pypdf import PdfReader from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
from pptx import Presentation from failures import (
record_ingest_failure as _record_failure_sql,
resolve_ingest_failure as _resolve_failure_sql,
)
load_dotenv(Path.home() / "aaronai" / ".env", override=True) load_dotenv(Path.home() / "aaronai" / ".env", override=True)
print("Loading embedding model...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
PG_DSN = os.getenv("PG_DSN") PG_DSN = os.getenv("PG_DSN")
def get_pg(): def get_pg():
return psycopg2.connect(PG_DSN) return psycopg2.connect(PG_DSN)
def extract_text_from_docx(path):
doc = Document(path)
return "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
def extract_text_from_pdf(path):
reader = PdfReader(path)
text = ""
for page in reader.pages:
extracted = page.extract_text()
if extracted:
text += extracted + "\n"
return text
def extract_text_from_pptx(path):
prs = Presentation(path)
text = ""
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text") and shape.text.strip():
text += shape.text + "\n"
return text
def extract_text_from_txt(path):
with open(path, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
if chunk.strip():
chunks.append(chunk)
start += chunk_size - overlap
return chunks
def make_id(filepath, chunk_index):
path_hash = hashlib.md5(str(filepath).encode()).hexdigest()[:8]
return f"{path_hash}_{chunk_index}"
def enqueue_stage2(source, full_text): def enqueue_stage2(source, full_text):
"""Enqueue document for Stage 2 (Mistral orientation) Stage 3 (Graphiti ingest). """Enqueue document for Stage 2 (Mistral orientation) -> Stage 3 (Graphiti ingest).
TEMPORARY: this queue feed will be removed when pgvector is decommissioned TEMPORARY: this queue feed will be removed when pgvector is decommissioned
and the watcher calls Stage 2 directly. and the watcher calls Stage 2 directly.
""" """
@@ -87,94 +54,127 @@ def enqueue_stage2(source, full_text):
except Exception as e: except Exception as e:
print(f" Stage 2 queue insert failed (non-fatal): {e}") print(f" Stage 2 queue insert failed (non-fatal): {e}")
def ingest_file(filepath):
path = Path(filepath)
suffix = path.suffix.lower()
if path.name.startswith("~$") or path.name.startswith("."):
return 0
def _record_failure(filepath: Path, error: str) -> None:
try: try:
if suffix == ".docx":
text = extract_text_from_docx(path)
elif suffix == ".pdf":
text = extract_text_from_pdf(path)
elif suffix == ".pptx":
text = extract_text_from_pptx(path)
elif suffix in [".txt", ".md"]:
text = extract_text_from_txt(path)
else:
return 0
if not text.strip():
return 0
chunks = chunk_text(text)
if not chunks:
return 0
embeddings = embedder.encode(chunks).tolist()
ids = [make_id(path, i) for i in range(len(chunks))]
metadatas = [{
"source": path.name,
"filepath": str(path),
"folder": str(path.parent.relative_to(Path(sys.argv[1]) if len(sys.argv) > 1 else path.parent))
} for _ in chunks]
# STAGE 1: Write to pgvector (TEMPORARY — remove when chat agent migrates to Graphiti)
pg = get_pg() pg = get_pg()
cur = pg.cursor() try:
for chunk_id, chunk, embedding, meta in zip(ids, chunks, embeddings, metadatas): _record_failure_sql(pg, filepath.name, filepath, error)
cur.execute(""" finally:
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata) pg.close()
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
metadata = EXCLUDED.metadata
""", (
chunk_id, chunk, embedding,
meta.get("source"), "document", None,
json.dumps(meta)
))
pg.commit()
pg.close()
print(f" Indexed {len(chunks)} chunks: {path.name}")
# Enqueue for Stage 2 → Stage 3 (Graphiti pipeline)
# SKIP_STAGE2_ENQUEUE env var set by migration scripts to prevent bulk enqueue
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
enqueue_stage2(path.name, text)
return len(chunks)
except Exception as e: except Exception as e:
print(f" Error: {path.name}: {e}") print(f" Could not record ingest failure (non-fatal): {e}")
def _resolve_failure(source: str) -> None:
try:
pg = get_pg()
try:
_resolve_failure_sql(pg, source)
finally:
pg.close()
except Exception as e:
print(f" Could not resolve ingest failure record (non-fatal): {e}")
IGNORED_TOP_FOLDERS = {"Drafts"}
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
# "~" catches Office lock files (~$) including the case where Nextcloud
# filesystem encoding has mangled the "$" to a unicode replacement char.
if filepath.name.startswith(("~", ".")):
return 0 return 0
if filepath.suffix.lower() not in SUPPORTED:
return 0
if root is not None:
try:
rel = filepath.parent.relative_to(root)
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
return 0
except ValueError:
pass
blocks = extract_blocks(filepath)
if not blocks or not any(
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
for b in blocks
):
_record_failure(filepath, "Text extraction failed or empty")
return 0
folder_rel = None
if root is not None:
try:
folder_rel = str(filepath.parent.relative_to(root))
except ValueError:
pass
try:
rows = chunk_and_embed(blocks, filepath.name, embedder,
filepath=filepath, folder=folder_rel)
except Exception as e:
_record_failure(filepath, f"Embedding failed: {e}")
return 0
if not rows:
return 0
try:
pg = get_pg()
try:
write_embeddings_batch(pg, rows)
finally:
pg.close()
except Exception as e:
_record_failure(filepath, f"pgvector write failed: {e}")
return 0
print(f" Indexed {len(rows)} chunks: {filepath.name}")
_resolve_failure(filepath.name)
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
full_text = "\n".join(
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
for b in blocks
)
enqueue_stage2(filepath.name, full_text)
return len(rows)
def ingest_directory(folder, embedder=None) -> dict:
"""Programmatic entry point. Returns {scanned, ingested, failed, total_chunks}.
If embedder is None, loads its own SentenceTransformer (CLI back-compat path).
Caller (e.g. api.py /api/reindex) should pass its module-level embedder so
the ~200MB model isn't reloaded per call.
"""
folder = Path(folder)
if not folder.exists():
return {"scanned": 0, "ingested": 0, "failed": 0, "total_chunks": 0,
"error": f"folder not found: {folder}"}
if embedder is None:
print("Loading embedding model...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
files = [f for f in folder.rglob("*")
if f.suffix.lower() in SUPPORTED
and not f.name.startswith(("~$", "."))]
print(f"Found {len(files)} files to process")
ingested = failed = total_chunks = 0
for f in files:
n = _ingest_one(f, embedder, root=folder)
if n > 0:
ingested += 1
total_chunks += n
else:
failed += 1
return {"scanned": len(files), "ingested": ingested, "failed": failed,
"total_chunks": total_chunks}
def ingest_folder(folder_path): def ingest_folder(folder_path):
folder = Path(folder_path) """CLI back-compat wrapper. Loads its own embedder."""
if not folder.exists(): result = ingest_directory(Path(folder_path))
print(f"Folder not found: {folder_path}") print(f"\nDone. {result['ingested']} files / {result['total_chunks']} chunks indexed; "
sys.exit(1) f"{result['failed']} failed.")
supported = [".docx", ".pdf", ".pptx", ".txt", ".md"]
files = [f for f in folder.rglob("*")
if f.suffix.lower() in supported
and not f.name.startswith("~$")
and not f.name.startswith(".")]
if not files:
print("No supported files found.")
sys.exit(1)
print(f"Found {len(files)} files to process\n")
total_chunks = 0
for f in files:
total_chunks += ingest_file(f)
print(f"\nDone. Total chunks indexed: {total_chunks}")
if __name__ == "__main__": if __name__ == "__main__":
target = sys.argv[1] if len(sys.argv) > 1 else str(Path.home() / "aaronai" / "docs") target = sys.argv[1] if len(sys.argv) > 1 else str(Path.home() / "aaronai" / "docs")
+18 -3
View File
@@ -18,8 +18,14 @@ CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
PG_DSN = os.getenv("PG_DSN") PG_DSN = os.getenv("PG_DSN")
MIN_EXCHANGES = 3 MIN_EXCHANGES = 3
print("Loading embedding model...") _embedder = None
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def get_embedder():
global _embedder
if _embedder is None:
print("Loading embedding model...")
_embedder = SentenceTransformer("all-MiniLM-L6-v2")
return _embedder
def get_conversations(): def get_conversations():
conn = sqlite3.connect(CONVERSATIONS_DB) conn = sqlite3.connect(CONVERSATIONS_DB)
@@ -123,9 +129,18 @@ def run():
# Embed and insert # Embed and insert
texts = [c[1] for c in new_chunks] texts = [c[1] for c in new_chunks]
embeddings = embedder.encode(texts, show_progress_bar=False).tolist() embeddings = get_embedder().encode(texts, show_progress_bar=False).tolist()
for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings): for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
if not meta.get("type"):
raise ValueError(
f"chunk {chunk_id!r} missing 'type'; writers must supply it "
f"(see Improvement #2 in docs/birdai-component-inventory)"
)
# ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
# COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
# which advances on activity. Re-running this script on an active conv
# should refresh the timestamp, not preserve the first-seen one.
cur.execute(""" cur.execute("""
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata) INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, %s, %s) VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
+136
View File
@@ -0,0 +1,136 @@
"""
Orientation Indexer — feeds Stage 2's document-level orientations into pgvector
so they're searchable alongside chunk text by the retrieve_documents tool.
Each completed row in stage_3_queue has an `orientation` string (active_frames
+ frame_relationships + extraction_orientation + one_sentence_summary) that
describes the document at a conceptual level. Indexing it as its own row in
the embeddings table gives the cross-encoder a second surface to rank against
"what is this document about" rather than just "what does this chunk say."
This worker is part of the "read-only Graphiti + orientation-into-pgvector"
plan B that replaced the Stage 3 → Graphiti write path. The graph layer is
queried directly via the search_facts chat tool; orientations land here.
State tracking: a row is considered indexed if the embeddings table already
holds a row with source=<source> and metadata->>'kind'='orientation'. The
worker is idempotent — restart-safe, resumable.
Runs as systemd: aaronai-orientation-indexer.service
"""
import logging
import os
import sys
import time
from pathlib import Path
from dotenv import load_dotenv
import psycopg2
from sentence_transformers import SentenceTransformer
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
sys.path.insert(0, str(Path(__file__).parent))
from encoding import write_embeddings_batch
PG_DSN = os.getenv("PG_DSN")
EMBED_MODEL = "all-MiniLM-L6-v2"
BATCH_SIZE = 25
POLL_INTERVAL_SECS = 30
LOG_FILE = "/var/log/aaronai/orientation-indexer.log"
HEARTBEAT_FILE = "/var/log/aaronai/orientation-indexer-heartbeat"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [orientation-indexer] %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE, mode="a")],
)
log = logging.getLogger("orientation-indexer")
def get_pg():
return psycopg2.connect(PG_DSN)
def fetch_unindexed(cur, limit):
"""Pull stage_3_queue rows with a non-null orientation whose orientation
hasn't been written to the embeddings table yet."""
cur.execute(
"""
SELECT s.source, s.orientation
FROM stage_3_queue s
WHERE s.orientation IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM embeddings e
WHERE e.source = s.source
AND e.metadata->>'kind' = 'orientation'
)
ORDER BY s.enqueued_at
LIMIT %s
""",
(limit,),
)
return cur.fetchall()
def _row_for(source: str, orientation: str, embedding) -> dict:
"""Build an embeddings row for the orientation. id is deterministic so
re-runs don't create duplicates if the unique check above ever races."""
import hashlib
chunk_id = hashlib.md5(f"orientation:{source}".encode()).hexdigest()[:8] + "_orient"
return {
"id": chunk_id,
"document": orientation,
"embedding": embedding,
"source": source,
"type": "document",
"metadata": {
"source": source,
"kind": "orientation",
},
}
def write_heartbeat():
try:
Path(HEARTBEAT_FILE).write_text(str(time.time()))
except Exception:
pass
def main():
log.info("Orientation indexer starting...")
log.info(f"Loading embedding model: {EMBED_MODEL}")
embedder = SentenceTransformer(EMBED_MODEL)
log.info("Embedding model ready.")
while True:
write_heartbeat()
try:
pg = get_pg()
try:
cur = pg.cursor()
rows = fetch_unindexed(cur, BATCH_SIZE)
if not rows:
pg.close()
time.sleep(POLL_INTERVAL_SECS)
continue
orientations = [r[1] for r in rows]
embeddings = embedder.encode(orientations).tolist()
batch = [
_row_for(source, orient, emb)
for (source, orient), emb in zip(rows, embeddings)
]
write_embeddings_batch(pg, batch)
log.info(f"Indexed {len(batch)} orientation(s)")
finally:
pg.close()
except Exception as e:
log.error(f"Indexing loop iteration failed: {e}")
time.sleep(POLL_INTERVAL_SECS)
if __name__ == "__main__":
main()
+146
View File
@@ -0,0 +1,146 @@
"""One-off: re-ingest docx+pptx after the 2026-05-04 extractor upgrade (commit 93c0d89).
Pre-upgrade extraction missed tables, headers/footers, text boxes, group shapes,
and pptx notes — leaving CVs/dossiers as section-header skeletons in the index.
Steps when run with --apply:
1. DELETE all embeddings rows where source ends in .docx or .pptx
2. Walk NEXTCLOUD_PATH and re-ingest every .docx/.pptx via _ingest_one
3. Stage 2 enqueue is suppressed (SKIP_STAGE2_ENQUEUE=1)
Without --apply: dry-run. Counts files and chunks, prints a sample, writes nothing.
"""
import os
import re
import sys
import time
from pathlib import Path
os.environ["SKIP_STAGE2_ENQUEUE"] = "1"
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
import psycopg2
from sentence_transformers import SentenceTransformer
sys.path.insert(0, str(Path(__file__).parent))
from ingest import _ingest_one, get_pg
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
APPLY = "--apply" in sys.argv
_ext_args = [a for a in sys.argv[1:] if a.startswith("--ext=")]
if _ext_args:
TARGET_EXTS = {("." + e.lstrip(".")) for arg in _ext_args
for e in arg.split("=", 1)[1].split(",")}
else:
TARGET_EXTS = {".docx", ".pptx"}
def _ext_regex():
inner = "|".join(re.escape(e.lstrip(".")) for e in sorted(TARGET_EXTS))
return f"\\.({inner})$"
def count_stale():
pg = get_pg()
cur = pg.cursor()
cur.execute(
f"SELECT lower(substring(source from '\\.[^.]+$')) AS ext, "
f"COUNT(DISTINCT source) AS files, COUNT(*) AS chunks "
f"FROM embeddings WHERE lower(source) ~ '{_ext_regex()}' "
f"GROUP BY 1 ORDER BY 1"
)
rows = cur.fetchall()
pg.close()
return rows
def delete_stale():
pg = get_pg()
cur = pg.cursor()
cur.execute(f"DELETE FROM embeddings WHERE lower(source) ~ '{_ext_regex()}'")
deleted = cur.rowcount
pg.commit()
pg.close()
return deleted
def find_files():
files = []
for f in NEXTCLOUD_PATH.rglob("*"):
if not f.is_file():
continue
if f.suffix.lower() not in TARGET_EXTS:
continue
if f.name.startswith(("~$", ".")):
continue
files.append(f)
return files
def main():
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
print(f"Target: {NEXTCLOUD_PATH}")
print(f"Extensions: {sorted(TARGET_EXTS)}")
print(f"SKIP_STAGE2_ENQUEUE={os.environ.get('SKIP_STAGE2_ENQUEUE')}")
print()
print("Stale chunks currently in DB:")
for ext, files, chunks in count_stale():
print(f" {ext}: {files} files, {chunks} chunks")
print()
files = find_files()
by_ext = {}
for f in files:
by_ext.setdefault(f.suffix.lower(), []).append(f)
print(f"Files on disk to re-ingest:")
for ext, lst in sorted(by_ext.items()):
print(f" {ext}: {len(lst)} files")
print(f" total: {len(files)}")
print()
print("Sample (5 random):")
import random
for f in random.sample(files, min(5, len(files))):
print(f" {f}")
print()
if not APPLY:
print("Dry-run only. Re-run with --apply to delete + re-ingest.")
return
print("Deleting stale chunks...")
n = delete_stale()
print(f" deleted {n} rows")
print()
print("Loading embedder...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
print()
print(f"Re-ingesting {len(files)} files...")
started = time.time()
ingested = failed = total_chunks = 0
for i, f in enumerate(files, 1):
n = _ingest_one(f, embedder, root=NEXTCLOUD_PATH)
if n > 0:
ingested += 1
total_chunks += n
else:
failed += 1
if i % 25 == 0 or i == len(files):
elapsed = time.time() - started
rate = i / elapsed if elapsed else 0
print(f" [{i}/{len(files)}] ingested={ingested} failed={failed} "
f"chunks={total_chunks} ({rate:.1f} files/s)")
elapsed = time.time() - started
print()
print(f"Done in {elapsed:.0f}s: {ingested} ingested, {failed} failed, "
f"{total_chunks} chunks written.")
if __name__ == "__main__":
main()
+123
View File
@@ -0,0 +1,123 @@
"""One-off: remove embeddings rows that no longer correspond to a file on disk.
Two passes:
1. Modern rows (metadata.filepath set): check each filepath, delete if missing.
2. Legacy rows (metadata.filepath null): build a set of all basenames present
anywhere under NEXTCLOUD_PATH, then delete rows whose `source` basename
isn't in that set.
Default mode is a dry-run (counts + sample paths, no writes). Pass --apply to
actually delete.
"""
import os
import sys
from pathlib import Path
from collections import defaultdict
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
import psycopg2
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
APPLY = "--apply" in sys.argv
def get_pg():
return psycopg2.connect(os.environ["PG_DSN"])
def scan_modern_orphans():
"""Rows with metadata.filepath whose file doesn't exist on disk."""
pg = get_pg()
cur = pg.cursor()
cur.execute(
"SELECT id, source, metadata->>'filepath' AS filepath "
"FROM embeddings WHERE metadata->>'filepath' IS NOT NULL"
)
orphans = []
by_source = defaultdict(int)
for row in cur.fetchall():
fp = row[2]
if fp and not Path(fp).exists():
orphans.append(row)
by_source[row[1]] += 1
pg.close()
return orphans, by_source
def scan_legacy_orphans():
"""Rows without metadata.filepath whose basename isn't anywhere under
NEXTCLOUD_PATH. Restricted to type='document' so conversations and memory
snapshots (which are synthetic sources, not files on disk) aren't flagged
as orphans. Walks the filesystem once to build the basename set."""
print(f" walking {NEXTCLOUD_PATH} to build basename index...")
on_disk = set()
for p in NEXTCLOUD_PATH.rglob("*"):
if p.is_file():
on_disk.add(p.name)
print(f" {len(on_disk):,} files on disk")
pg = get_pg()
cur = pg.cursor()
cur.execute(
"SELECT id, source FROM embeddings "
"WHERE metadata->>'filepath' IS NULL AND type = 'document'"
)
orphans = []
by_source = defaultdict(int)
for row in cur.fetchall():
if row[1] not in on_disk:
orphans.append(row)
by_source[row[1]] += 1
pg.close()
return orphans, by_source
def delete_rows(ids):
pg = get_pg()
cur = pg.cursor()
cur.execute("DELETE FROM embeddings WHERE id = ANY(%s)", (list(ids),))
deleted = cur.rowcount
pg.commit()
pg.close()
return deleted
def main():
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
print(f"Target: {NEXTCLOUD_PATH}")
print()
print("Pass 1 — modern rows (metadata.filepath set):")
modern, modern_by_src = scan_modern_orphans()
print(f" {len(modern):,} orphan rows across {len(modern_by_src):,} files")
for src, n in sorted(modern_by_src.items(), key=lambda kv: -kv[1])[:10]:
print(f" {n:>4} chunks — {src}")
print()
print("Pass 2 — legacy rows (no metadata.filepath):")
legacy, legacy_by_src = scan_legacy_orphans()
print(f" {len(legacy):,} orphan rows across {len(legacy_by_src):,} files")
for src, n in sorted(legacy_by_src.items(), key=lambda kv: -kv[1])[:10]:
print(f" {n:>4} chunks — {src}")
print()
total = len(modern) + len(legacy)
if total == 0:
print("Nothing to delete.")
return
if not APPLY:
print(f"Dry-run only. Re-run with --apply to delete {total:,} rows.")
return
print(f"Deleting {total:,} orphan rows...")
n1 = delete_rows([r[0] for r in modern]) if modern else 0
n2 = delete_rows([r[0] for r in legacy]) if legacy else 0
print(f" modern: {n1:,} legacy: {n2:,} total: {n1 + n2:,}")
if __name__ == "__main__":
main()
+53
View File
@@ -0,0 +1,53 @@
"""End-to-end test of retrieve_context with intent routing + reranking.
Avoids loading the full FastAPI app; replicates the chat-handler retrieval
call shape and prints classifier output + final ranked sources for each query.
"""
import os
import sys
from pathlib import Path
from dotenv import load_dotenv
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
sys.path.insert(0, str(Path(__file__).parent))
# Stub anthropic so api.py import doesn't fail without the SDK loaded.
# We only need retrieve_context.
import types
sys.modules.setdefault("anthropic", types.ModuleType("anthropic"))
sys.modules["anthropic"].Anthropic = lambda **kw: None
# Same for whisper if present
if "faster_whisper" not in sys.modules:
sys.modules["faster_whisper"] = types.ModuleType("faster_whisper")
import importlib.util
spec = importlib.util.spec_from_file_location("api", Path(__file__).parent / "api.py")
api = importlib.util.module_from_spec(spec)
# Don't execute the whole module (it starts FastAPI). Instead, exec only definitions.
# Easier: just import the functions we need by exec'ing the file but catching errors.
try:
spec.loader.exec_module(api)
except Exception as e:
print(f"(continuing despite api.py side-effect error: {e})")
retrieve_context = api.retrieve_context
QUERIES = [
"write me a bio",
"my professional bio",
"Aaron Nelson CV consulting and design work",
"FWN3D consulting",
"syllabi I have taught",
"philosophy of teaching",
"Hudson Valley Additive Manufacturing Center",
"Aaron Nelson is an artist and educator working in additive manufacturing",
]
for q in QUERIES:
pieces, sources = retrieve_context(q)
print(f"\n=== {q!r} ===")
for i, src in enumerate(sources, 1):
print(f" {i}. {src}")
+149 -98
View File
@@ -19,7 +19,6 @@ Architecture: Stage 1 (watcher) -> stage_2_queue -> Stage 2 (Mistral) -> stage_3
import os import os
import time import time
import json import json
import hashlib
import logging import logging
import threading import threading
from pathlib import Path from pathlib import Path
@@ -30,9 +29,11 @@ from sentence_transformers import SentenceTransformer
from watchdog.observers import Observer from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler from watchdog.events import FileSystemEventHandler
from docx import Document as DocxDocument from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
from pypdf import PdfReader from failures import (
from pptx import Presentation record_ingest_failure as _record_failure_sql,
resolve_ingest_failure as _resolve_failure_sql,
)
load_dotenv(Path.home() / "aaronai" / ".env", override=True) load_dotenv(Path.home() / "aaronai" / ".env", override=True)
@@ -42,10 +43,7 @@ STATE_FILE = "/home/aaron/aaronai/watcher_state.json"
STATUS_FILE = "/home/aaron/aaronai/watcher_status.json" STATUS_FILE = "/home/aaron/aaronai/watcher_status.json"
HEARTBEAT_FILE = "/home/aaron/aaronai/watcher_heartbeat" HEARTBEAT_FILE = "/home/aaron/aaronai/watcher_heartbeat"
SUPPORTED = {".pdf", ".docx", ".pptx", ".txt", ".md"}
DEBOUNCE_SECONDS = 120 DEBOUNCE_SECONDS = 120
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBED_MODEL = "all-MiniLM-L6-v2" EMBED_MODEL = "all-MiniLM-L6-v2"
PG_DSN = os.getenv("PG_DSN") PG_DSN = os.getenv("PG_DSN")
@@ -76,49 +74,6 @@ def get_pg():
return psycopg2.connect(PG_DSN) return psycopg2.connect(PG_DSN)
def extract_text(path: Path) -> str:
suffix = path.suffix.lower()
try:
if suffix == ".docx":
doc = DocxDocument(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
elif suffix == ".pdf":
reader = PdfReader(path)
return "".join(
page.extract_text() + "\n"
for page in reader.pages if page.extract_text()
)
elif suffix == ".pptx":
prs = Presentation(path)
return "\n".join(
shape.text for slide in prs.slides
for shape in slide.shapes
if hasattr(shape, "text") and shape.text.strip()
)
elif suffix in {".txt", ".md"}:
return path.read_text(encoding="utf-8", errors="ignore")
except Exception as e:
log.warning(f"Text extraction failed for {path.name}: {e}")
record_ingest_failure(path, f"Text extraction failed: {e}")
return ""
def chunk_text(text: str) -> list:
words = text.split()
chunks = []
start = 0
while start < len(words):
chunk = " ".join(words[start:start + CHUNK_SIZE])
if chunk.strip():
chunks.append(chunk)
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def make_chunk_id(filepath: Path, chunk_index: int) -> str:
return hashlib.md5(str(filepath).encode()).hexdigest()[:8] + f"_{chunk_index}"
def enqueue_stage2(source: str, full_text: str): def enqueue_stage2(source: str, full_text: str):
if os.getenv("SKIP_STAGE2_ENQUEUE"): if os.getenv("SKIP_STAGE2_ENQUEUE"):
return return
@@ -143,21 +98,15 @@ def enqueue_stage2(source: str, full_text: str):
def record_ingest_failure(filepath: Path, error: str): def record_ingest_failure(filepath: Path, error: str):
"""Write extraction or ingest failure to ingest_failures table for UI visibility.""" """Write extraction or ingest failure to ingest_failures table for UI visibility.
Local wrapper around failures.record_ingest_failure — opens conn, delegates,
logs non-fatal errors so the caller never has to handle them."""
try: try:
pg = get_pg() pg = get_pg()
cur = pg.cursor() try:
cur.execute(""" _record_failure_sql(pg, filepath.name, filepath, error)
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at) finally:
VALUES (%s, %s, %s, 0, NOW(), NOW()) pg.close()
ON CONFLICT (source) DO UPDATE SET
error = EXCLUDED.error,
retry_count = ingest_failures.retry_count + 1,
last_failed_at = NOW(),
resolved = FALSE
""", (filepath.name, str(filepath), error[:1000]))
pg.commit()
pg.close()
except Exception as e: except Exception as e:
log.warning(f"Could not record ingest failure (non-fatal): {e}") log.warning(f"Could not record ingest failure (non-fatal): {e}")
@@ -166,57 +115,104 @@ def resolve_ingest_failure(source: str):
"""Mark a previously failed file as resolved after successful ingest.""" """Mark a previously failed file as resolved after successful ingest."""
try: try:
pg = get_pg() pg = get_pg()
cur = pg.cursor() try:
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,)) _resolve_failure_sql(pg, source)
pg.commit() finally:
pg.close() pg.close()
except Exception as e: except Exception as e:
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}") log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
def delete_embeddings_for_path(filepath: Path):
"""Remove embeddings rows for a file that no longer exists. Matches by
metadata.filepath so multi-folder same-basename files don't collide.
Legacy rows without filepath metadata are left alone — they get cleaned
by sweep_orphans.py."""
try:
pg = get_pg()
try:
cur = pg.cursor()
cur.execute(
"DELETE FROM embeddings WHERE metadata->>'filepath' = %s",
(str(filepath),),
)
deleted = cur.rowcount
pg.commit()
if deleted:
log.info(f"Deleted {deleted} chunks for removed file: {filepath}")
finally:
pg.close()
except Exception as e:
log.warning(f"Could not delete embeddings for {filepath} (non-fatal): {e}")
def remove_from_state(filepath: Path):
"""Drop a deleted file from watcher_state.json so it isn't carried as
'known mtime' indefinitely."""
try:
state = load_state()
key = str(filepath)
if key in state:
del state[key]
save_state(state)
except Exception as e:
log.warning(f"Could not update state for deleted {filepath} (non-fatal): {e}")
IGNORED_TOP_FOLDERS = {"Drafts"}
def ingest_file(filepath: Path, embedder) -> int: def ingest_file(filepath: Path, embedder) -> int:
if filepath.name.startswith(("~$", ".")): if filepath.name.startswith(("~$", "~", ".")):
return 0 return 0
if filepath.suffix.lower() not in SUPPORTED: if filepath.suffix.lower() not in SUPPORTED:
return 0 return 0
text = extract_text(filepath)
if not text.strip():
return 0
chunks = chunk_text(text)
if not chunks:
return 0
try: try:
embeddings = embedder.encode(chunks).tolist() rel = filepath.parent.relative_to(NEXTCLOUD_PATH)
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
return 0
except ValueError:
pass
blocks = extract_blocks(filepath)
if not blocks or not any(
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
for b in blocks
):
record_ingest_failure(filepath, "Text extraction failed or empty")
return 0
folder_rel = None
try:
folder_rel = str(filepath.parent.relative_to(NEXTCLOUD_PATH))
except ValueError:
pass
try:
rows = chunk_and_embed(blocks, filepath.name, embedder,
filepath=filepath, folder=folder_rel)
except Exception as e: except Exception as e:
log.error(f"Embedding failed for {filepath.name}: {e}") log.error(f"Embedding failed for {filepath.name}: {e}")
record_ingest_failure(filepath, f"Embedding failed: {e}") record_ingest_failure(filepath, f"Embedding failed: {e}")
return 0 return 0
if not rows:
return 0
source = filepath.name source = filepath.name
try: try:
pg = get_pg() pg = get_pg()
cur = pg.cursor() try:
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)): write_embeddings_batch(pg, rows)
chunk_id = make_chunk_id(filepath, i) finally:
cur.execute(""" pg.close()
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
ON CONFLICT (id) DO UPDATE SET
document = EXCLUDED.document,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
metadata = EXCLUDED.metadata
""", (chunk_id, chunk, embedding, source, "document",
json.dumps({"source": source, "filepath": str(filepath)})))
pg.commit()
pg.close()
except Exception as e: except Exception as e:
log.error(f"pgvector write failed for {filepath.name}: {e}") log.error(f"pgvector write failed for {filepath.name}: {e}")
record_ingest_failure(filepath, f"pgvector write failed: {e}") record_ingest_failure(filepath, f"pgvector write failed: {e}")
return 0 return 0
log.info(f"Indexed {len(chunks)} chunks: {filepath.name}") log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
resolve_ingest_failure(source) resolve_ingest_failure(source)
enqueue_stage2(source, text) full_text = "\n".join(
return len(chunks) f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
for b in blocks
)
enqueue_stage2(source, full_text)
return len(rows)
def ingest_files(paths: list, embedder, state: dict) -> dict: def ingest_files(paths: list, embedder, state: dict) -> dict:
@@ -224,7 +220,8 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
for path in paths: for path in paths:
count = ingest_file(path, embedder) count = ingest_file(path, embedder)
total += count total += count
state[str(path)] = str(path.stat().st_mtime) if count > 0:
state[str(path)] = str(path.stat().st_mtime)
log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.") log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
return state return state
@@ -252,12 +249,24 @@ def get_changed_files(state: dict) -> list:
continue continue
if path.suffix.lower() not in SUPPORTED: if path.suffix.lower() not in SUPPORTED:
continue continue
if path.name.startswith((".", "~$")): if path.name.startswith((".", "~$", "~")):
continue continue
if "Admin/Backups" in str(path) or "Backups" in path.parts: if "Admin/Backups" in str(path) or "Backups" in path.parts:
continue continue
if "Journal/Media" in str(path): if "Journal/Media" in str(path):
continue continue
if "Generative Design" in path.parts and "Processing" in path.parts:
continue
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
continue
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
and "Presentations" in path.parts:
continue
if path.name == "GH Slicer Notes [Autosaved].pptx" \
and "DDF555 3D Computational" in path.parts:
continue
if path.stat().st_size == 0:
continue
if state.get(str(path)) != str(path.stat().st_mtime): if state.get(str(path)) != str(path.stat().st_mtime):
changed.append(path) changed.append(path)
return changed return changed
@@ -336,12 +345,22 @@ class IngestHandler(FileSystemEventHandler):
self.last_event = 0 self.last_event = 0
def _should_ignore(self, path: Path) -> bool: def _should_ignore(self, path: Path) -> bool:
if path.name.startswith((".", "~$")): if path.name.startswith((".", "~$", "~")):
return True return True
if "Admin/Backups" in str(path) or "Backups" in path.parts: if "Admin/Backups" in str(path) or "Backups" in path.parts:
return True return True
if "Journal/Media" in str(path): if "Journal/Media" in str(path):
return True return True
if "Generative Design" in path.parts and "Processing" in path.parts:
return True
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
return True
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
and "Presentations" in path.parts:
return True
if path.name == "GH Slicer Notes [Autosaved].pptx" \
and "DDF555 3D Computational" in path.parts:
return True
return False return False
def on_created(self, event): def on_created(self, event):
@@ -367,15 +386,47 @@ class IngestHandler(FileSystemEventHandler):
def on_moved(self, event): def on_moved(self, event):
if event.is_directory: if event.is_directory:
return return
src = Path(event.src_path)
dest = Path(event.dest_path)
# If destination is outside NEXTCLOUD_PATH (e.g., Nextcloud trashbin at
# /home/aaron/nextcloud/data/data/aaron/files_trashbin/), treat as a
# delete — the file is no longer in the watched corpus.
try:
dest.relative_to(NEXTCLOUD_PATH)
except ValueError:
if src.suffix.lower() in SUPPORTED:
log.info(f"Event: moved out of tree {src} -> {dest}")
threading.Thread(
target=lambda: (
delete_embeddings_for_path(src),
remove_from_state(src),
),
daemon=True,
).start()
return
# Nextcloud WebDAV writes .part temp files then renames to final path. # Nextcloud WebDAV writes .part temp files then renames to final path.
# src_path is the .part file; dest_path is the final filename. # src_path is the .part file; dest_path is the final filename.
dest = Path(event.dest_path)
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest): if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
return return
log.info(f"Event: moved -> {dest}") log.info(f"Event: moved -> {dest}")
self.pending = True self.pending = True
self.last_event = time.time() self.last_event = time.time()
def on_deleted(self, event):
if event.is_directory:
return
path = Path(event.src_path)
if path.suffix.lower() not in SUPPORTED:
return
log.info(f"Event: deleted {path}")
threading.Thread(
target=lambda: (
delete_embeddings_for_path(path),
remove_from_state(path),
),
daemon=True,
).start()
def on_closed(self, event): def on_closed(self, event):
# FileClosedEvent fires on the final file after Nextcloud completes write. # FileClosedEvent fires on the final file after Nextcloud completes write.
# Belt-and-suspenders catch for any write pattern not caught by on_moved. # Belt-and-suspenders catch for any write pattern not caught by on_moved.