Implements `dreamer-design-spec.md` lines 27-74: observe_corpus() returns a
signal vector (new_chunks delta, new_journal_entries, recent_questions over
14-day window, days_since_dream, underprocessed_count derived from the new
consolidation cursor); select_mode() returns one of {nrem, early-rem,
late-rem, lucid} or None per the spec's rules. The None return is the spec's
canonical answer to the repetition problem (line 67) — "dreamer goes quiet
rather than manufacturing novelty."
Standalone for now. Not wired into dream_pipeline yet — that happens in the
retrieve() refactor (task #46). dream.py is unchanged in this commit.
Grounded sources cited in module docstring: Friston Active Inference, sleep
research (Stickgold/Walker/Diekelberg & Born), sharp-wave ripples (Buzsáki).
All three appear in BirdAI-Bibliography.md.
Migration prerequisite (already shipped in the prior commit): consolidation
cursor columns last_consolidated_at + consolidation_count added to
embeddings. Backfill from dream-manifest history is task #49.
After chat() returns, fire-and-forget background thread POSTs the (user
message + assistant response) as one episode to /episodes. Default extraction
(Sonnet). Errors logged, never raised — chat is not gated on the write.
Wall-clock cost in the background is ~20 min per episode against the
current ~4,300-entity graph. The chat experience is unaffected; the graph
catches up with a delay. Search_facts queries reflect new turns once the
sidecar has finished processing them.
Kill-switch: SKIP_GRAPHITI_CHAT_PUSH=1 in the api service environment
disables the push without code changes. Useful if dedup contention surfaces
under sustained load.
Companions to this commit: search_facts tool (e96bf40), orientation indexer
worker (e96bf40), FalkorDB vector index patches (d2ec20e, 313c0f0).
After establishing that single-episode Graphiti writes take ~20 min against
the existing graph (the dedup loop is structurally slow regardless of the
patches, the bridge, or the LLM model), the salvage plan is to stop trying
to write to Graphiti and instead:
1. Use the existing 4,300-entity graph as a read-only fact layer at chat
time via a new search_facts tool. Graphiti's /search endpoint is fast
(~15ms direct, ~400ms over HTTP); the graph is stale-as-of-early-May
but covers most biographical / relational content that "write me a bio"
and similar queries care about.
2. Pipe Stage 2's document-level orientations into pgvector via a new
orientation_indexer worker. Stage 2 already runs and writes orientation
text to stage_3_queue for every Mistral-processed document; the worker
reads those, embeds them, and writes one row per source to embeddings
with metadata->>'kind'='orientation'. retrieve_documents now ranks
against both chunk text and document-level concept summaries.
Idempotent: the indexer's "is this already indexed" check is an EXISTS
subquery against embeddings, so restarts and partial runs are safe.
Out of scope (deliberately): no Graphiti writes from chat, no Stage 2 ->
Graphiti bridge, no draining the 711-item stage_3_queue backlog into
Graphiti. Rich-extraction posture stays a BirdAI concern.
graphiti-core 0.29.0 builds FalkorSearchOperations as driver._search_ops in
FalkorDriver.__init__ but never assigns it to driver.search_interface.
search_utils.py dispatches on search_interface; without this one-line bridge
it falls back to interpreted-Cypher cosine math doing full table scans for
every entity dedup similarity check.
Combined with the vendored patches in graphiti_patches/ (restored in the
previous commit d2ec20e), this activates FalkorDB's native vector index for
the dedup similarity path. Empirical impact (per the original f645b74 commit
message): single-episode add_episode against a ~4,277-entity graph went from
indefinite hang to ~8.2 seconds.
Surgical restore: cherry-picks only the bridge code from f645b74 — not the
Pattern 1 async job model, not the v2.4 extraction instructions, neither of
which we want. Default extraction posture (taxonomy-naïve) stays the
operating mode. Rich-extraction story remains a BirdAI concern.
watcher.py now listens for on_deleted events and treats on_moved
destinations that fall outside NEXTCLOUD_PATH (Nextcloud trashbin, moves
to other volumes) as deletes. Both cases call delete_embeddings_for_path
(DELETE WHERE metadata.filepath = ...) and remove_from_state to drop the
file from watcher_state.json so it isn't carried as known-mtime.
Match is by metadata.filepath, not source basename, so files that share a
name across folders don't collide.
scripts/sweep_orphans.py is the one-time cleanup for chunks the watcher
missed before this fix:
- Modern pass: rows with metadata.filepath whose file no longer exists.
- Legacy pass: rows with NULL filepath and type='document' whose basename
isn't anywhere on disk. type='document' restriction skips conversations
and memory snapshots (synthetic sources, not files on disk).
First run cleaned 629 rows: 628 from moved-file duplicates (e.g., BirdAI
docs that traveled across Journal/, Library/, Journal/Projects/BirdAI/)
plus the AARON_NELSON_BIO.pdf phantom Aaron flagged.
- MAX_RETRIEVALS_PER_TURN (5): after five retrieve_documents calls in a single
turn, further calls return a budget-exhausted message instead of executing.
Caps cost on runaway multi-query loops without forbidding compound questions.
- MAX_CITED_SOURCES (5): accumulated_sources was growing to 14+ entries across
multiple tool calls and showing chunks Claude never actually used. Cap the
list returned to the UI at 5, preserving insertion order so the
highest-relevance early-call results survive. Proper fix (Claude-driven
inline citations) is bigger work, noted for later.
- ingest.py lock-file skip: changed prefix tuple from ("~$", ".") to ("~", ".")
so it catches Office lock files even when Nextcloud's filesystem encoding has
mangled the "$" into a unicode replacement char. Matches what watcher.py
already does.
Previous prompt let Aaron skip the preview if he asked up front. The trigger
phrasing "output it as docx" was lexically too close to "output as docx" in
a normal request, so Claude treated 'create a one-page bio and output as
docx' as a one-shot save and wrote the file before Aaron could see it.
Removed the escape hatch. Draft-then-commit is now the only flow.
The previous system prompt instructed Claude to skip duplicating document
content in chat and write the file directly. That produced no-preview UX:
the user asked for a bio and the docx appeared in Drafts/ before they had
a chance to read or refine it. Reversed: Claude now drafts in chat first,
waits for an explicit save signal, and only then calls save_document. The
explicit "skip preview" escape hatch is preserved for one-shot flows.
The systemd unit pins PATH to the venv only, so subprocess.run(['pandoc', ...])
raised FileNotFoundError even though pandoc was installed at /usr/bin/pandoc.
The handler's "pandoc not installed" message was misleading — pandoc was
reachable from a login shell but not from the service. Rephrased to point at
the actual cause: the service's PATH. The systemd drop-in to extend PATH is
not committed here (lives at /etc/systemd/system/aaronai.service.d/path.conf
on the host).
Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he
asks for a document (bio, cover letter, statement, CV section) rather than
chat text. Pandoc handles markdown -> docx and markdown -> pdf with the
xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance
dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD
in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision
check appends _2, _3, ... until unique. Filename sanitization strips path
components and unsafe characters.
System prompt instructs Claude to call save_document when the user wants a
file (not chat text) and not to duplicate the file contents in the chat
response — just write the file and tell Aaron where it landed.
ingest.py and watcher.py now skip files under Drafts/ at ingest time so
generated drafts don't pollute future retrieval. Drafts can still be opened,
edited, and shipped; they just don't become part of the searchable corpus
unless Aaron explicitly moves them out of Drafts/.
Move persistent memory from the user message into system blocks with
cache_control: ephemeral on the last block. The static prefix (system prompt +
memory, ~3-5K tokens typically) is identical between the two LLM calls of a
tool_use round-trip and stable across turns within the 5-minute cache TTL.
Without this, the tool-call retrieval architecture roughly doubled input
token cost on retrieval-needed turns (full context billed twice). With cache
reads at ~10% of standard input, the duplication cost drops by ~90% — the
"twice as expensive" hit becomes "slightly more expensive plus tool overhead."
client_time stays in the user message (per-turn dynamic, should not be in the
cached prefix).
Removes classify_retrieval_intent and the type/folder filter parameters on
retrieve_context. The keyword classifier was the same anti-pattern as the
formatting-driven docx chunker: a heuristic that locks the user into specific
phrasings and fails silently on anything novel. A scope enum (personal /
library / conversations / memory) would have been the same heuristic in a
fancier wrapper — the categories themselves are mine, not Aaron's.
New shape: a retrieve_documents tool exposed to Claude. Tool takes a single
query argument; the model decides when to call it, what to search for, and
how many times per turn (multi-query falls out naturally for compound asks).
Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt,
but corpus content is fetched on demand by the model with concrete queries
it crafts itself, not the user's raw phrasing.
retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup,
no filters. The reranker ranks, the model judges relevance. When ranking
fails (e.g. abstract instructional queries pulling philosophy books), the
right fix is a better reranker, not another query-time taxonomy. That work
is acknowledged but deferred.
System prompt updated to teach the model about the tool and to prefer
concrete tokens (named entities, project names, course codes) over abstract
phrasing when constructing search queries.
extract_blocks(filepath) is the new structured-extraction entry point, returning
list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk
back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading
prepended for retrieval context and stored in metadata).
- pptx: one block per slide. Slide title becomes block heading; speaker notes
fold into the body. Image-only decks with title-only slides now produce
heading-only chunks instead of being recorded as extraction failures.
- docx: deliberately single-block (back-compat). Heading-style section detection
was implemented and rolled back: hand-formatted CVs are Normal-styled with
bold-as-heading, and tying chunk boundaries to formatting choices would lock
future-user into preserving those choices forever. Lexical + cross-encoder
retrieval already handles substring matching inside blind-chunked CVs.
- pdf/txt/md: unchanged (single block, blind chunking).
Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it
as secondary sort key in _rerank so memory/journal snapshots prefer the latest
copy among near-duplicate content.
reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a
subset; previous hardcoded delete regex would have wiped both even with a
single-ext target.
Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:
- Library/personal split. classify_retrieval_intent now returns
(type_filter, folder_exclude_prefixes). Biographical document intent excludes
Library/* so philosophy/cognition books stop crowding out CVs and dossiers
for queries like "write me a bio".
- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
Teaching Philosophy.pdf in different application folders) used to fill the
top-N with the same content. Dedup by first-300-chars hash after rerank.
- Folder in source citations. Surface metadata.folder alongside basename so
the LLM can disambiguate among 21 CV.docx variants and the user can see
which copy a citation refers to.
Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
taking final top-N
Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.
Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).
Empty transcripts and transcription failures previously
deleted the temp audio and returned without writing any
record to disk — violating parity-at-encode (raw content
is episodic context, not noise).
- Preserve audio in Journal/Media/YYYY-MM/ on all paths
(success, empty, failure) instead of unlinking.
- Write a markdown entry to Journal/Captures/ on failure
paths with status, audio_path, and error fields.
- Add status: saved to successful captures so frontmatter
is uniform across success and failure.
- Fire SSE capture_saved events on all terminal paths,
with status included.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-user personal app threat model is theft-of-device, not
stolen-cookie. 30-day idle re-prompts created friction without
proportional security benefit. Server TTL and client max-age
remain in sync via shared constant.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- session_exists() now rejects rows older than 30 days,
matching the client cookie max-age.
- Opportunistic cleanup of expired rows on session_exists()
calls, preventing unbounded growth of sessions.db from
orphaned tokens (PWA reinstalls, manual cookie clears).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional commit=True parameter to write_embeddings_batch. When True
(default, matching prior behavior), the function commits the connection
after the per-row UPSERT loop. When False, the caller manages the
transaction.
This unblocks fix#1 (pgvector-bypass paths) and fix#2 (watcher
two-transaction pattern), both of which need to compose embeddings writes
with other database writes in the same transaction. Without this lever,
either fix would require duplicating the UPSERT logic outside this helper
or introducing a second commit boundary inside an otherwise atomic
operation.
No behavior change for existing callers — they all use the default
commit=True and continue working unchanged.
The capture endpoint (api.py:702, 833) writes Journal/Captures/*.md
files with a markdown-bold-style header block (`**type:** voice`,
`**modality:** audio`, `**status:** unprocessed`, optional `**media:**`
and `**project:**`) followed by a `---` separator. extract_text for .md
was a bare filepath.read_text, so every capture-derived chunk in
pgvector embedded the frontmatter as raw text, polluting retrieval.
Fix adds _strip_md_frontmatter, called only for the .md branch:
- Capture-style: optional leading H1 (preserved), then consecutive
`**key:** value` lines (and blanks), terminated by `---`. The H1 is
retained; the key/value block + separator are removed.
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
Only triggered when no heading precedes — guards against the common
`# Title` + `---` (horizontal rule under heading) pattern seen in
Journal/aaronai-architecture.md and four other Journal/*.md files.
Body `**bold:**` lines (e.g. `**Visual description:**` in image
captures) and body `---` horizontal rules are never touched: the scan
aborts as soon as a non-frontmatter line appears in the leading block.
briefing_generator_v2.py's split("---", 1) heuristic was reviewed and
not reused — fragile on substring matches and on documents with
multiple `---` rules.
Verified against:
- 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1
retained.
- 2026-04-27-04-34-image.md: frontmatter stripped, `**Visual
description:**` and `**Voice annotation:**` body bold-headers
retained, trailing `---` not consumed.
- Journal/aaronai-architecture.md (5 body `---` rules): output
byte-identical to read_text (96101 chars).
- Synthetic YAML doc: stripped correctly when no leading heading.
- Synthetic plain markdown with body `---` rules: untouched.
- Empty input + heading-only file: untouched.
Existing capture chunks in pgvector retain polluted text; the fix only
affects future extractions. Backfill decision deferred — the cleanest
path is `touch -h Journal/Captures/*.md` to bump mtime and let the
watcher re-ingest naturally on the next cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three files in the original ingest_failures cohort have been
characterized via direct OCR and confirmed to lack ingestible text:
- Presentations/Renders.pptx — 35 PICTURE-shape renders, 33/35 zero-char
on OCR, 2 with noise (20 and 29 chars).
- Presentations/Ribbon Cutting Slideshow.pptx — 10-slide event photo
deck, 9/10 zero-char, 1 with 17 chars of noise.
- Academic/DDF555 3D Computational/GH Slicer Notes [Autosaved].pptx —
Office autosave duplicate of GH Slicer Notes.pptx; first 9 images
byte-identical (sha256) to the canonical file. 2 net-new images
contribute 36 noisy chars. Excluding to prevent double-embedding the
same content under two source filenames.
Pattern matches f18fb64 (path.parts membership). Folder-level globs
were considered and rejected: /Presentations/ contains successfully
embedded text-bearing decks (aaronnelson_3D 4D.pptx,
aaronnelson_slideslam.pptx). Exact-name + parent-folder membership
applied in both watcher filter sites (get_changed_files and
IngestHandler._should_ignore).
The fourth file in the cohort, GH Slicer Notes.pptx (the canonical
non-autosave version), was confirmed to carry 379 chars of real text
(Grasshopper UI / code samples) across 6/9 images. It remains in
ingest_failures unresolved, awaiting the eventual ocrmypdf backlog
pass.
Cleanup: 3 ingest_failures rows resolved (the excluded files).
Unresolved count: 94 → 91.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The messages table declares FOREIGN KEY (conversation_id) REFERENCES
conversations(id), but PRAGMA foreign_keys was never enabled — SQLite
defaults it to OFF per connection, and _connect() did not set it. Two
orphan rows existed in messages (conversation_id='test123' pointing at
a never-existing conversation; both rows from one ~11-second test event
on 2026-04-26).
Audit before changing the PRAGMA:
- All FOREIGN KEY declarations across both DBs (conversations.db,
sessions.db) accounted for via PRAGMA foreign_key_list on each
table. Only one FK exists: messages.conversation_id ->
conversations.id, ON DELETE NO ACTION.
- All tables enumerated via sqlite_master. Two tables in
conversations.db (conversations, messages); one in sessions.db
(sessions). No surprises.
- PRAGMA foreign_key_check confirmed exactly the 2 known orphans and
zero violations elsewhere.
Both delete paths in api.py (delete_conversation at :471, and
clear_all_conversations at :986) already delete from messages BEFORE
conversations, so cascade behavior was correct in code. The orphan
state was caused by a direct INSERT against a non-existent
conversation_id at chat-test time, which an unenforced FK silently
accepted. Turning the PRAGMA on prevents this class of bug at insert
time, not delete time — no delete-path code changes were needed.
Order of operations followed the constraint that orphan cleanup must
precede PRAGMA-on (SQLite would not retroactively delete orphans, but
foreign_key_check would surface them confusingly on any future
operation that touched the messages table):
1. DELETE FROM messages WHERE conversation_id NOT IN (SELECT id FROM
conversations) — removed the 2 known orphans.
2. Added PRAGMA foreign_keys=ON to _connect() so every connection
from _connect_conversations() and _connect_sessions() gets FK
enforcement (SQLite requires per-connection setting).
3. Restarted aaronai.service.
Verification:
- Smoke: GET /api/conversations and /api/conversations/{id}/messages
both return 200 with expected payloads against the live api.
- E2E single-delete: synthetic conversation + 2 messages inserted via
the api's _connect helper (FK on); DELETE /api/conversations/{id}
via the live endpoint removed both rows from both tables.
- Clear-all e2e: skipped on live DB (destructive) — code shape is
structurally identical to single-delete, no FK-relevant logic
difference.
- Load-bearing negative test: INSERT into messages with a
non-existent conversation_id via _connect_conversations() raised
sqlite3.IntegrityError("FOREIGN KEY constraint failed"). This is
what proves the PRAGMA actually took effect, not just that we set
it.
Final counts: 7 conversations, 290 messages (down from 292 by the 2
orphans cleaned up).
Note: an explicit BEGIN/COMMIT around the two-execute delete paths
was considered and skipped. SQLite's implicit-transactional default
already gives the atomicity needed; explicit transactions would be
clarity-only and belong in a separate commit.
Two correctness bugs in dream_pipeline manifest assembly.
write_manifest at lines 487-491 swallowed HTTP 4xx/5xx responses
silently. requests.put() only raises on transport-level errors (DNS,
connection refused, timeout); 401/403/500/507 come back as Response
objects and never trigger the except. The code printed "Manifest
written" while the manifest never persisted. The same file's deliver()
function at line 434 already used response.raise_for_status() — the
pattern was already established, write_manifest just skipped it.
Fix: bind the response and call raise_for_status() before the success
print. The except message changes from "(non-critical)" to "manifest
not persisted" because HTTP failure now means manifest data was lost,
which is critical, not quiet.
corpus_data["total_chunks"] at lines 621-622 stored
delta["new_chunks"], duplicating the sibling field
new_chunks_since_last_dream. The field name claimed absolute corpus
size; the value was a delta of recently-touched files. Verified in
live manifests: total_chunks: 0 while pgvector held 11,379+ document
embeddings.
Fix: query SELECT COUNT(*) FROM embeddings inside dream_pipeline,
store as total_chunks. Tightly-scoped one-shot connect via the
existing get_pg() helper. Telemetry query failure is treated as
non-critical and falls back to 0 — pgvector hiccup should not crash
an otherwise successful dream pipeline.
Bonus finding (not fixed in this commit): new_chunks_since_last_dream
is itself misnamed. observe_corpus() reads the watcher's mtime cache
and counts files (not chunks) whose mtime is newer than last_dream.
Both fields were "files touched since last dream" duplicated under
two different names; this commit fixes only the total_chunks
semantics. Renaming new_chunks_since_last_dream is out of scope —
manifests are write-only telemetry today, no consumer reads either
field, and the rename is a separate decision.
Verification: real pipeline run produced manifest with total_chunks
matching SELECT COUNT(*) directly; doubled as a smoke test for the
embedder cache (single Loading weights line), type_distribution
propagation, and the manifest write success path.
Three rows in ingest_failures were Office lockfile leftovers whose
filename starts with ~� (~ followed by the UTF-8 replacement
character) instead of ~$. Somewhere in the Nextcloud sync chain the $
byte was lost or replaced; the file now lives on disk as a real file
with this corrupted name. The watcher's ("~$", ".") prefix filter
didn't match, so each cycle tried to ingest these as pptx, hit
BadZipFile inside python-pptx (lockfiles aren't real Office documents),
and they ended up permanently in ingest_failures.
Three filter sites in watcher.py applied the lockfile prefix check:
- ingest_file() at :127
- get_changed_files() at :200
- IngestHandler._should_ignore() at :290
All three now match ("~$", "~", ".") — broadened to catch any tilde
prefix, not just ~$. The cross-check against pgvector embeddings and
disk found zero legitimate tilde-prefixed files in the corpus, so the
broader filter has no false-positive risk in this corpus.
Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%').
Unresolved count drops 97 → 94.
If a fourth filter site is ever added, the right shape is consolidating
the lockfile prefix check to a shared function or constant. Three
parallel sites with three different tuple orderings is acceptable for
now but worth normalizing if the surface grows.
The previous extractors walked only top-level body paragraphs (docx) and
top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text"
ingest failures revealed that 13 docx files in the failure cohort have
100% of their content in tables (paras_with_text=0, table_cells=6-108).
These are syllabi, rosters, rubrics, and homework worksheets structured
as a single document-wide table — high-value academic content the corpus
was silently missing.
docx walker now covers:
- body paragraphs (existing)
- tables, including nested tables in cells (recursive helper)
- header and footer paragraphs per section
- text-box content via XPath against w:txbxContent (no first-class API
in python-docx; future-proofing — none of the current failure cohort
has text-boxes)
pptx walker now covers:
- top-level shape text (existing)
- recursive descent into group shapes
- table cell text via shape.has_table / shape.table.iter_cells()
- speaker notes via slide.notes_slide.notes_text_frame.text
Out of scope: SmartArt diagrams, chart titles/labels, OLE objects,
content controls. None of the current failure cohort has these.
Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are
image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two
GH Slicer Notes variants — all PICTURE-shape decks with no text in any
walkable structure). They stay in ingest_failures unresolved, awaiting
OCR or path exclusion.
Side effect worth noting: the regression check on 4 known-good files
that were already producing embeddings showed all four gained content
under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars
(+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA
program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew
22,259 to 23,546 (+1,287). These files have been in the corpus all
along with table or notes content silently dropped. They will surface
the additional content on next re-ingest, improving retrieval quality
for any future query that touches them.
Cleanup: ingest_file already calls resolve_ingest_failure on successful
ingest, so the 13 recovered files were marked resolved=TRUE during the
retry pass. No separate cleanup SQL was needed.
Two-sample diagnostic of the 128 ingest_failures rows surfaced two
folders whose contents are exclusively non-text PDFs (iText-produced
generative graphics from Processing sketches and computational design
sketches) and three zero-byte test artifacts. None of these have ever
produced an embedding chunk, and they have nothing extractable to
contribute. Excluding them removes 19 / 128 (15%) of the locked-out
failures from the cohort and prevents future versions of the same
patterns from re-failing.
Folder exclusions use path.parts membership rather than substring
matching — eliminates false-match risk if similarly-named folders
appear elsewhere in the corpus (e.g. an unrelated "Generative Design"
or "Computational Design 2017" directory created later). The existing
"Admin/Backups" / "Journal/Media" substring checks are looser, but
new exclusions take the tighter pattern.
Zero-byte filter goes in get_changed_files() only — the actual
ingestion gate. Adding stat() to _should_ignore() (the FS-event noise
filter) would introduce a race where the file is gone between event
fire and stat call. Empty files briefly trigger pending=True but
produce no work after debounce; cosmetic only.
Cleanup applied separately via UPDATE: 19 ingest_failures rows for
these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110.
Verified: get_changed_files() with empty state returns 1418 changed
files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent
from the result, control file present. Watcher service restarted
clean; startup scan reports no missed files.
ingest_files() updated state[path] = mtime unconditionally after every
ingest_file() call. ingest_file() returns 0 when text extraction fails,
embedding fails, no chunks are produced, or the pgvector write fails —
in every one of those cases, the path was still recorded as ingested
at the current mtime. On the next pass, get_changed_files() saw the
mtime match and skipped the file, locking it out of the corpus until
something modified it on disk.
record_ingest_failure() writes to a UI-visible failures table, but
nothing reads that table to retry. So failures accumulated silently:
the file was simultaneously logged as failed AND tracked in
watcher_state as up-to-date, and the second condition won.
Fix: only update watcher_state when ingest_file returns count > 0.
Failed ingests will be retried on the next watcher cycle until they
succeed or are explicitly excluded.
Diagnostic at fix time: 129 rows in ingest_failures, 128 currently
locked out of the corpus (filepath in watcher_state with mtime matching
current disk). 128/129 are text_extraction failures, mostly scanned
PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer
exists on disk. 0 have had their disk mtime change since failing — i.e.
without this fix, none of them would ever retry. Cross-check shows
watcher_state has 1466 paths vs. 1061 distinct sources in pgvector
embeddings, leaving a residual silent-gap of ~276 files after
accounting for failures.
Historical cleanup of files already locked out by this bug is tracked
separately. New failures from this commit forward will retry naturally.
Followup to 4204806 (WAL + index + backup.sh). The previous commit
deferred synchronous=NORMAL because it's a per-connection PRAGMA and
api.py has 16 sqlite3.connect() call sites — setting it once at init
would have applied to nothing afterwards.
Adds three helpers near the *_DB constants:
- _connect(path): inner; sets PRAGMA synchronous=NORMAL and uses
timeout=5.0 (5000ms busy_timeout) on every new connection.
- _connect_conversations(), _connect_sessions(): named wrappers so call
sites read explicitly.
Mechanical replacement at all 16 call sites: 4 sessions, 12 conversations.
No semantic change beyond the PRAGMA + busy_timeout — every site still
opens-then-closes, no held-open connections.
busy_timeout=5000ms is cheap insurance: under WAL with api.py as sole
writer, contention should be near-zero, but the backup.sh online-backup
path briefly holds a read lock on the source, and any future second
writer would otherwise hit SQLITE_BUSY immediately on contention.
Combined effect with WAL: per-write fsync count drops from ~2 to ~1
(WAL alone) further reduced by synchronous=NORMAL deferring fsyncs to
checkpoint boundaries. No durability loss for the use case (single
host, app crash tolerated, OS crash gives at most one lost transaction).
Not included: foreign_keys=ON. Audit found 2 orphan rows in messages
(conversation_id pointing to deleted conversations) and untested write
paths that could begin raising IntegrityError. Tracked as separate
followup: inspect orphans, identify the delete path that didn't
cascade, clean up, then enable enforcement and test chat delete flow
end-to-end.
Both databases ran with journal_mode=delete — every write rewrote the
rollback journal per transaction. WAL eliminates the journal-rewrite and
lets readers run without blocking writers.
Index on messages(conversation_id, timestamp DESC) is preventive — only
280 rows today, but the access pattern (load conversation history in
order) is exactly what a composite index serves, and we don't want to
re-revisit this when the table grows.
backup.sh updated in the same commit because WAL changes the on-disk
layout: a bare `cp` of just the .db file can miss recently-committed
transactions that still live in the -wal sidecar, and can race with
concurrent writes to produce a torn file. Switched to the SQLite Online
Backup API via python3 -c "...src.backup(dst)..." — same mechanism as
the sqlite3 CLI's `.backup` (which isn't installed on this host),
handles WAL correctly without forcing a checkpoint, and is non-locking
from the writer's perspective. Verified backup integrity_check returns
ok and row counts match.
Note: synchronous=NORMAL was considered but deferred — it's a
per-connection PRAGMA, and applying it correctly requires a connect
helper that wraps every sqlite3.connect() call site in api.py (~14
sites). Out of scope for this commit; tracked as a follow-up. WAL alone
delivers the journal-rewrite elimination and reader/writer concurrency
improvements; the additional fsync reduction from synchronous=NORMAL is
a smaller marginal win on top.
Confirmed via concurrency audit that api.py is the sole writer to both
databases. ingest_conversations.py and dream.py are read-only consumers
of conversations.db; nothing else touches sessions.db.
Embedder was instantiated at module import (~30-60s, ~200MB) regardless
of whether new conversations existed. On nights with no new content
(most nights per the logs), the script paid the load cost and exited
immediately. ingest.py:134 already uses lazy loading; this brings the
two ingest scripts into a consistent shape.
Pipeline mode calls retrieve() three times (NREM, Early REM, Late REM).
Previously each call re-imported and re-instantiated SentenceTransformer
("all-MiniLM-L6-v2"), allocating ~200MB and spending 30-60s on disk->CPU
init three times sequentially. lru_cache(maxsize=1) makes the load happen
once per process.
Expected: pipeline runtime drops ~100-180s, removes 2x redundant 200MB
allocations, and reduces transient memory pressure during the same window
when other nightly jobs may run.
Three changes to reduce voice-note transcription latency on the VPS:
- Model: large-v3 -> distil-large-v3 (~6x faster, near-identical English
accuracy; language is already hardcoded "en").
- beam_size: 5 (default) -> 1 (~3-4x faster on clean audio).
- cpu_threads: 8 -> 4 (the box has 8 cores running api, dreamer, watcher,
nextcloud concurrently; ctranslate2's inter-op pool plus context switching
makes 4 effectively faster than 8 here).
Combined effect expected ~10-15x over prior config. No accuracy regression
expected for the voice-note use case (English, clean audio, domain terms
already supplied via initial_prompt).
Writers now enforce type and created_at:
- encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
re-ingest re-classifies type but does not overwrite a backfilled mtime.
- ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
convo.updated_at; re-runs should refresh).
- Column-level NOT NULL is not added; application-layer raise gives a
faster, more debuggable failure than a Postgres constraint error.
Retrieval propagates type into chunks:
- retrieve() SELECT now includes type; chunk dicts carry "type": etype.
- WHERE clause built dynamically from excluded_sources and the new
--type-filter CLI arg (experimental, default None, pgvector retrieval
only — Graphiti chunks have no embeddings.type to filter on).
- retrieve_graphiti unchanged; its chunks lack the type field.
Manifests carry type_distribution per stage:
- dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
early_rem, late_rem — a Counter over chunk types, filtering None so
Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
distribution. Pgvector chunks always carry type post-backfill; if None
appears, the backfill or writer enforcement has regressed.
Verification:
B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
all 84 chunks preserved created_at=2026-04-27T06:11:55Z
B2 missing-type assertion raises ValueError, no row leaked to embeddings
B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
type_filter only, excl 2 elems, excl 1 elem edge case, both};
all five plans use HNSW index scan with correct Filter clauses
C1 retrieve("nrem") returns 8 chunks each carrying "type" key
C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
no single type >=90%)
The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit)
and 12,109 created_at-NULL rows via five batches:
C1 filepath_stat: 9,649 filesystem mtime via metadata.filepath
C2 watcher_state_unique: 676 unique source-name lookup in watcher_state
C3 watcher_state_collision_pick_latest_of_N:
234 collision; most-recent watcher mtime
C4 chatgpt_export: 1,548 convo create_time from export JSONs
(168/168 distinct convo_ids resolved)
C5 sentinel: 2 2026-04-26T00:00:00Z (pgvector migration date)
Provenance written to metadata.type_source and metadata.created_at_source
on every row changed by this run. type_source is empty on rows where the
type field was already populated pre-run; in those cases the snapshot
table is the source of truth for what changed.
Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type,
created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join).
Verification:
V1 live counts: type_null=0 ca_null=0
V2 spot-check 11 rows across cohorts: provenance correct
V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved
V4 cross-check vs snapshot: reconciles per-provenance to dry-run
Read-side use (B + C: writer enforcement + minimal retrieval read) deferred
to a separate session. The backfill is complete and verified, but the type
and created_at fields are not yet load-bearing — every current reader still
ignores them. Without B+C this lands as data prep, not behavior change.
Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).
Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
(CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.
Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
Stage 2 by design (`ingest_conversations.py` writes directly to
embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
Primary capture and self-reflection channels are silent to the frame
system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
`Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
rows.
No production code touched. View is droppable; script is read-only.
The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on
overflow) was hiding ~40% of the corpus from Early REM and Late REM after the
cap filled. The architecture and reframe both specify session-scoped novelty,
not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02
NREM exclusion fix.
Changes:
- Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key
from `dreamer_state.json` at pipeline start.
- Early REM excludes only the current session's NREM high-scorers.
- Late REM excludes only the current session's NREM \u222a Early REM.
- Remove the across-night accumulation block at the end of the pipeline; reuse
the in-scope state object for the post-pipeline metadata write (eliminates a
redundant disk re-read that was reintroducing the legacy key).
NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem",
excluded_sources=None)`).
Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early
REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key
absent from `dreamer_state.json` post-run.
Consolidates four extract paths and two extract-chunk-embed-write pipelines
into a single shared encoding module. Fixes the embedder lifecycle
divergence between watcher and /api/reindex (no more 200MB reload per
reindex click) and unifies failure tracking so /api/reindex failures now
surface in SettingsPanel "Ingest Health".
New files:
- scripts/encoding.py — extract_text, chunk_text, chunk_and_embed,
write_embeddings_batch
- scripts/failures.py — record_ingest_failure, resolve_ingest_failure
(shared by watcher.py and ingest.py)
Refactored:
- scripts/watcher.py — drops local extract/chunk/embed implementations
and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding
and failures. Now writes ingest_failures row on empty-text-extract
(was silent return 0).
- scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder,
embedder=None) for in-process invocation; CLI back-compat preserved via
ingest_folder wrapper. Module-level SentenceTransformer load removed.
- scripts/corpus_integrity.py — imports extract_text from encoding;
extract_text_for_retry function removed.
- scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses
module-level embedder; no subprocess); new /api/reindex/status endpoint
reading ~/aaronai/reindex_status.json; /api/corpus/retry imports
extract_text from encoding; INGEST_SCRIPT constant removed (dead after
this refactor); 409 reentrance guard prevents double-click stomping.
Behavior changes:
- /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks
threadpool, doesn't block API thread.
- /api/reindex no longer reloads SentenceTransformer on each click.
- /api/reindex failures newly write to ingest_failures (visible in
SettingsPanel "Ingest Health" — badge will jump on first reindex).
- New embeddings rows always have created_at = NOW() (canonical, server-side).
- New embeddings rows always include metadata.folder field (None when not
derivable).
- /api/reindex returns 409 on second click while a job is running.
- New /api/reindex/status endpoint for polling.
Existing 9,815 NULL created_at rows remain unchanged; backfill is a
separate decision if desired.
199 insertions, 256 deletions across 6 files (codebase shrinks net).
Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11).
Pre-commit verification: BackgroundTasks already imported, sys.path
resolves correctly via script-path semantics, static import clean.
prompt_hash() in dream.py was hashing function __doc__ strings, but the
synth functions don't have docstrings, so the hash was always MD5("") =
d41d8cd9 for every dream. The manifest field meant to detect undeclared
prompt drift carried no useful information.
Refactor:
- Each synth function's prompt template moved to a module-level constant
(NREM_PROMPT_TEMPLATE, EARLY_REM_PROMPT_TEMPLATE, LATE_REM_PROMPT_TEMPLATE,
SYNTHESIS_PROMPT_TEMPLATE, LUCID_PROMPT_TEMPLATE) using str.format()
placeholders instead of f-string interpolation.
- Synth functions call TEMPLATE.format(...) at use time. Output is byte-
identical to the previous f-string implementation.
- prompt_hash() now hashes the four pipeline template constants (lucid is
on-demand, not part of the nightly manifest — preserves prior scope).
- LUCID_DEFAULT_TASK extracted as a named constant from the lucid fallback
question (factoring only, no behavior change).
- PROMPT_VERSION_* constants and synth function signatures untouched.
- v1.1 register-shift comment in synthesize_early_rem preserved inline.
The post-fix hash will differ from d41d8cd9 (verified: b65695a1 in static
test). Historical manifests still carry d41d8cd9; the discontinuity is
intentional — pre-fix hashes were equally meaningless and faking continuity
would be worse than acknowledging the break.
Found by Track 1 inventory 2026-05-02 (Finding 11 / divergence #11).
Verified static import + hash determinism before commit.
- Fix /auth/check endpoint that referenced undefined SESSIONS
(Phase 1 finding — would NameError 500 on every call). Now uses
session_exists(token), the live session-validation mechanism
defined elsewhere in api.py.
- Remove unused DB_PATH ChromaDB-era constant (paired with the
ChromaDB directory deletion and aaronai-maintenance.service
removal earlier this session).
Found by Track 1 inventory 2026-05-02. Cross-repo verification of
share_time (third candidate from the original cleanup proposal)
revealed it is working stores-and-returns persistence rather than
dead code; share_time intentionally not modified.
Inventory document edits are committed separately under the docs/
tracking decision.
The dream_mode setting was defined in DEFAULT_SETTINGS and watched
by update_settings for reschedule, but run_dream_job never read it —
silently-ignored configuration.
Two changes:
1. DEFAULT_SETTINGS["dream_mode"] flipped from "nrem" to "pipeline".
The default was a latent regression vector: wiring up the setting
without changing the default would have silently switched all
default-config users from full-pipeline (current production
behavior) to NREM-only nightly runs.
2. run_dream_job reads dream_mode at fire-time, validates against
{"pipeline", "nrem", "early-rem", "late-rem"}, falls back to
pipeline with a warning on invalid values. Lucid intentionally
excluded — it is on-demand only by design and remains available
via CLI and /api/dreamer/run.
Nightly dream production behavior is unchanged for current users
(no settings.json key → default "pipeline" → no flag passed → same
as before). Users can now meaningfully change the nightly mode by
editing settings.json or via the SettingsPanel.
Found by Track 1 inventory 2026-05-02 (Finding 9 / divergence #9).
Moves 28 experiment scripts to scripts/experiments/ (E1, E1.4, E1.6, E2,
base_class, cascade, cost_test, briefing, consistency, token series).
Moves 2 dissolved-layer scripts to scripts/deprecated/ (consolidator_v0_1.py,
tier1_migration.py — under the bespoke decision both target retired
substrate work).
Removes 19 .bak* files from disk (gitignored, never tracked; git history
is the durable record of every prior version).
The 11 production scripts remain in scripts/. All systemd ExecStart paths,
api.py subprocess calls, and cron jobs continue to resolve correctly —
verified by grep against /etc/systemd/system/aaronai-*.service, scripts/
references in api.py, and the user crontab.
Track 1 inventory cross-cutting finding: scripts/ mixed 11 production
files with 32 experimental scripts and ~20 .bak files. After this commit
a clean-room reader can identify the live workers from a directory listing
alone.
Found by Track 1 inventory 2026-05-02. See
~/aaronai/docs/scripts-reorg-plan-2026-05-02.md for full reasoning.
After commit, run:
1. git log --oneline -3 — show the new commit on top
2. git status — confirm clean working tree (modulo the docs/ untracked files which are intentional)
The F14 fix on 2026-05-01 removed text[:50000] truncation from
watcher.py, ingest.py, and corpus_integrity.py. The retry endpoint
in api.py was missed — clicking 'Retry' on an ingest-failed file
in the SettingsPanel re-introduced the exact truncation pattern
F14 was meant to eliminate.
Found by Track 1 inventory 2026-05-02 (Finding 2 / divergence #2).
NREM in the reframe is replay-and-consolidation of recent encoded
content. Excluding previously_retrieved sources turns NREM into
novelty-finding, which is Late REM's job. NREM should re-traverse
already-encoded content; that's what consolidation is.
The May 2 abort surfaced this — 52 sources accumulated in the
exclusion list, all of them in NREM's similarity band for the
recurring research/fabrication/teaching query. The dreamer hit
zero retrievable chunks not because the corpus was empty, but
because everything semantically aligned was excluded.
Late REM and Early REM keep the exclusion mechanism — novelty is
their job. Session-scoped exclusion (nrem_high_sources flowing
into Early REM) also preserved.
The 500/400 trim on retrieved_sources is preserved for the
remaining stages that still use it.
Mirrors stage2_worker v2.1 (da98019) resilience fixes:
- Absolute paths for /usr/bin/sudo and /bin/systemctl
- Log stdout/stderr when sidecar restart fails
- Reset consecutive_failures even when wedge recovery fails (prevents
permanent stuck state if restart itself is broken)
Three classes of silent failure converted to clean terminal states:
- Mistral timeout: previously left rows in zombie state (started_at set,
failed_at null, attempts incremented past retry threshold, row invisible
to selection query). Now sets failed_at with reason
'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents
accumulated in this state during the Stage 3 saga deadlock incident.
- Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on
JSON decode failure but process_one wasn't checking, so empty orientation
('Active frames: . Frame relationships: ...') was shipped to Stage 3.
This is F22 from the 2026-04-30 code review. Now sets failed_at with
reason 'mistral_parse_failure'.
- Wedge recovery hammering: consecutive_failures was only reset on
successful Ollama restart. With the sudo path bug (also fixed here),
recovery always failed, so every subsequent failure re-attempted restart.
Now resets the counter regardless and logs the failure visibly.
Also: subprocess.run now uses absolute paths (/usr/bin/sudo,
/bin/systemctl) instead of relying on PATH, fixing the 'No such file or
directory: sudo' error that broke Stage 2's recover_wedge() since
deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the
PATH issue was masking that fix.
Worker version bumped to 2.1 to match Stage 3's resilience patch level.
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.
stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
commits, all sharing the same saga tag for Graphiti document linking.
Original implementation sent all chunks as one saga; 17-19 chunk sagas
deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.
Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
Workers have no sd_notify support; per-request timeouts in code
handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
systemctl restart ollama and restart aaronai-graphiti.service.
Stage 2's existing recover_wedge() was silently broken since
deployment due to this gap.
.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).
Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.
Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.
- api.py: strip CV pinning workaround (parity violation, see architecture doc)
- dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches
3x and filters in-process. Was silently dropping the parameter; would have
confounded E3 with broken cross-stage exclusion in Graphiti arm.
- watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was
propagating through entire cascade. Postgres TEXT can hold up to 1GB.
- corpus_integrity.py: F37 — same truncation, third path now clean.
Backups: api.py.bak.*, dream.py.bak.*, watcher.py.bak.*, ingest.py.bak.*,
corpus_integrity.py.bak.* timestamped pre-fix.
Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected
by F14, 414KB).