aaronAI

Author	SHA1	Message	Date
aaron	7b77794319	api.py: enable PRAGMA foreign_keys=ON in _connect helper; clean up 2 message orphans The messages table declares FOREIGN KEY (conversation_id) REFERENCES conversations(id), but PRAGMA foreign_keys was never enabled — SQLite defaults it to OFF per connection, and _connect() did not set it. Two orphan rows existed in messages (conversation_id='test123' pointing at a never-existing conversation; both rows from one ~11-second test event on 2026-04-26). Audit before changing the PRAGMA: - All FOREIGN KEY declarations across both DBs (conversations.db, sessions.db) accounted for via PRAGMA foreign_key_list on each table. Only one FK exists: messages.conversation_id -> conversations.id, ON DELETE NO ACTION. - All tables enumerated via sqlite_master. Two tables in conversations.db (conversations, messages); one in sessions.db (sessions). No surprises. - PRAGMA foreign_key_check confirmed exactly the 2 known orphans and zero violations elsewhere. Both delete paths in api.py (delete_conversation at :471, and clear_all_conversations at :986) already delete from messages BEFORE conversations, so cascade behavior was correct in code. The orphan state was caused by a direct INSERT against a non-existent conversation_id at chat-test time, which an unenforced FK silently accepted. Turning the PRAGMA on prevents this class of bug at insert time, not delete time — no delete-path code changes were needed. Order of operations followed the constraint that orphan cleanup must precede PRAGMA-on (SQLite would not retroactively delete orphans, but foreign_key_check would surface them confusingly on any future operation that touched the messages table): 1. DELETE FROM messages WHERE conversation_id NOT IN (SELECT id FROM conversations) — removed the 2 known orphans. 2. Added PRAGMA foreign_keys=ON to _connect() so every connection from _connect_conversations() and _connect_sessions() gets FK enforcement (SQLite requires per-connection setting). 3. Restarted aaronai.service. Verification: - Smoke: GET /api/conversations and /api/conversations/{id}/messages both return 200 with expected payloads against the live api. - E2E single-delete: synthetic conversation + 2 messages inserted via the api's _connect helper (FK on); DELETE /api/conversations/{id} via the live endpoint removed both rows from both tables. - Clear-all e2e: skipped on live DB (destructive) — code shape is structurally identical to single-delete, no FK-relevant logic difference. - Load-bearing negative test: INSERT into messages with a non-existent conversation_id via _connect_conversations() raised sqlite3.IntegrityError("FOREIGN KEY constraint failed"). This is what proves the PRAGMA actually took effect, not just that we set it. Final counts: 7 conversations, 290 messages (down from 292 by the 2 orphans cleaned up). Note: an explicit BEGIN/COMMIT around the two-execute delete paths was considered and skipped. SQLite's implicit-transactional default already gives the atomicity needed; explicit transactions would be clarity-only and belong in a separate commit.	2026-05-04 16:41:55 +00:00
aaron	d985f9e91e	dream.py: raise_for_status on manifest writes; total_chunks as actual corpus count Two correctness bugs in dream_pipeline manifest assembly. write_manifest at lines 487-491 swallowed HTTP 4xx/5xx responses silently. requests.put() only raises on transport-level errors (DNS, connection refused, timeout); 401/403/500/507 come back as Response objects and never trigger the except. The code printed "Manifest written" while the manifest never persisted. The same file's deliver() function at line 434 already used response.raise_for_status() — the pattern was already established, write_manifest just skipped it. Fix: bind the response and call raise_for_status() before the success print. The except message changes from "(non-critical)" to "manifest not persisted" because HTTP failure now means manifest data was lost, which is critical, not quiet. corpus_data["total_chunks"] at lines 621-622 stored delta["new_chunks"], duplicating the sibling field new_chunks_since_last_dream. The field name claimed absolute corpus size; the value was a delta of recently-touched files. Verified in live manifests: total_chunks: 0 while pgvector held 11,379+ document embeddings. Fix: query SELECT COUNT() FROM embeddings inside dream_pipeline, store as total_chunks. Tightly-scoped one-shot connect via the existing get_pg() helper. Telemetry query failure is treated as non-critical and falls back to 0 — pgvector hiccup should not crash an otherwise successful dream pipeline. Bonus finding (not fixed in this commit): new_chunks_since_last_dream is itself misnamed. observe_corpus() reads the watcher's mtime cache and counts files (not chunks) whose mtime is newer than last_dream. Both fields were "files touched since last dream" duplicated under two different names; this commit fixes only the total_chunks semantics. Renaming new_chunks_since_last_dream is out of scope — manifests are write-only telemetry today, no consumer reads either field, and the rename is a separate decision. Verification: real pipeline run produced manifest with total_chunks matching SELECT COUNT() directly; doubled as a smoke test for the embedder cache (single Loading weights line), type_distribution propagation, and the manifest write success path.	2026-05-04 16:29:04 +00:00
aaron	b9eea6cb62	watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes Three rows in ingest_failures were Office lockfile leftovers whose filename starts with ~� (~ followed by the UTF-8 replacement character) instead of ~$. Somewhere in the Nextcloud sync chain the $ byte was lost or replaced; the file now lives on disk as a real file with this corrupted name. The watcher's ("~$", ".") prefix filter didn't match, so each cycle tried to ingest these as pptx, hit BadZipFile inside python-pptx (lockfiles aren't real Office documents), and they ended up permanently in ingest_failures. Three filter sites in watcher.py applied the lockfile prefix check: - ingest_file() at :127 - get_changed_files() at :200 - IngestHandler._should_ignore() at :290 All three now match ("~$", "~", ".") — broadened to catch any tilde prefix, not just ~$. The cross-check against pgvector embeddings and disk found zero legitimate tilde-prefixed files in the corpus, so the broader filter has no false-positive risk in this corpus. Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%'). Unresolved count drops 97 → 94. If a fourth filter site is ever added, the right shape is consolidating the lockfile prefix check to a shared function or constant. Three parallel sites with three different tuple orderings is acceptable for now but worth normalizing if the surface grows.	2026-05-04 16:19:56 +00:00
aaron	93c0d89308	encoding.py: extend docx and pptx extractors to walk tables, headers/footers, text-boxes, group shapes, and notes The previous extractors walked only top-level body paragraphs (docx) and top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text" ingest failures revealed that 13 docx files in the failure cohort have 100% of their content in tables (paras_with_text=0, table_cells=6-108). These are syllabi, rosters, rubrics, and homework worksheets structured as a single document-wide table — high-value academic content the corpus was silently missing. docx walker now covers: - body paragraphs (existing) - tables, including nested tables in cells (recursive helper) - header and footer paragraphs per section - text-box content via XPath against w:txbxContent (no first-class API in python-docx; future-proofing — none of the current failure cohort has text-boxes) pptx walker now covers: - top-level shape text (existing) - recursive descent into group shapes - table cell text via shape.has_table / shape.table.iter_cells() - speaker notes via slide.notes_slide.notes_text_frame.text Out of scope: SmartArt diagrams, chart titles/labels, OLE objects, content controls. None of the current failure cohort has these. Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two GH Slicer Notes variants — all PICTURE-shape decks with no text in any walkable structure). They stay in ingest_failures unresolved, awaiting OCR or path exclusion. Side effect worth noting: the regression check on 4 known-good files that were already producing embeddings showed all four gained content under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars (+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew 22,259 to 23,546 (+1,287). These files have been in the corpus all along with table or notes content silently dropped. They will surface the additional content on next re-ingest, improving retrieval quality for any future query that touches them. Cleanup: ingest_file already calls resolve_ingest_failure on successful ingest, so the 13 recovered files were marked resolved=TRUE during the retry pass. No separate cleanup SQL was needed.	2026-05-04 16:12:56 +00:00
aaron	f18fb64fe5	watcher.py: exclude generative-graphic folders and zero-byte files Two-sample diagnostic of the 128 ingest_failures rows surfaced two folders whose contents are exclusively non-text PDFs (iText-produced generative graphics from Processing sketches and computational design sketches) and three zero-byte test artifacts. None of these have ever produced an embedding chunk, and they have nothing extractable to contribute. Excluding them removes 19 / 128 (15%) of the locked-out failures from the cohort and prevents future versions of the same patterns from re-failing. Folder exclusions use path.parts membership rather than substring matching — eliminates false-match risk if similarly-named folders appear elsewhere in the corpus (e.g. an unrelated "Generative Design" or "Computational Design 2017" directory created later). The existing "Admin/Backups" / "Journal/Media" substring checks are looser, but new exclusions take the tighter pattern. Zero-byte filter goes in get_changed_files() only — the actual ingestion gate. Adding stat() to _should_ignore() (the FS-event noise filter) would introduce a race where the file is gone between event fire and stat call. Empty files briefly trigger pending=True but produce no work after debounce; cosmetic only. Cleanup applied separately via UPDATE: 19 ingest_failures rows for these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110. Verified: get_changed_files() with empty state returns 1418 changed files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent from the result, control file present. Watcher service restarted clean; startup scan reports no missed files.	2026-05-04 06:24:08 +00:00
aaron	72e07afc03	watcher.py: do not mark failed ingests as successfully ingested ingest_files() updated state[path] = mtime unconditionally after every ingest_file() call. ingest_file() returns 0 when text extraction fails, embedding fails, no chunks are produced, or the pgvector write fails — in every one of those cases, the path was still recorded as ingested at the current mtime. On the next pass, get_changed_files() saw the mtime match and skipped the file, locking it out of the corpus until something modified it on disk. record_ingest_failure() writes to a UI-visible failures table, but nothing reads that table to retry. So failures accumulated silently: the file was simultaneously logged as failed AND tracked in watcher_state as up-to-date, and the second condition won. Fix: only update watcher_state when ingest_file returns count > 0. Failed ingests will be retried on the next watcher cycle until they succeed or are explicitly excluded. Diagnostic at fix time: 129 rows in ingest_failures, 128 currently locked out of the corpus (filepath in watcher_state with mtime matching current disk). 128/129 are text_extraction failures, mostly scanned PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer exists on disk. 0 have had their disk mtime change since failing — i.e. without this fix, none of them would ever retry. Cross-check shows watcher_state has 1466 paths vs. 1061 distinct sources in pgvector embeddings, leaving a residual silent-gap of ~276 files after accounting for failures. Historical cleanup of files already locked out by this bug is tracked separately. New failures from this commit forward will retry naturally.	2026-05-04 03:52:01 +00:00
aaron	c3011c80a5	api.py: route all sqlite3.connect() through helpers; enable synchronous=NORMAL per-conn Followup to `4204806` (WAL + index + backup.sh). The previous commit deferred synchronous=NORMAL because it's a per-connection PRAGMA and api.py has 16 sqlite3.connect() call sites — setting it once at init would have applied to nothing afterwards. Adds three helpers near the *_DB constants: - _connect(path): inner; sets PRAGMA synchronous=NORMAL and uses timeout=5.0 (5000ms busy_timeout) on every new connection. - _connect_conversations(), _connect_sessions(): named wrappers so call sites read explicitly. Mechanical replacement at all 16 call sites: 4 sessions, 12 conversations. No semantic change beyond the PRAGMA + busy_timeout — every site still opens-then-closes, no held-open connections. busy_timeout=5000ms is cheap insurance: under WAL with api.py as sole writer, contention should be near-zero, but the backup.sh online-backup path briefly holds a read lock on the source, and any future second writer would otherwise hit SQLITE_BUSY immediately on contention. Combined effect with WAL: per-write fsync count drops from ~2 to ~1 (WAL alone) further reduced by synchronous=NORMAL deferring fsyncs to checkpoint boundaries. No durability loss for the use case (single host, app crash tolerated, OS crash gives at most one lost transaction). Not included: foreign_keys=ON. Audit found 2 orphan rows in messages (conversation_id pointing to deleted conversations) and untested write paths that could begin raising IntegrityError. Tracked as separate followup: inspect orphans, identify the delete path that didn't cascade, clean up, then enable enforcement and test chat delete flow end-to-end.	2026-05-04 03:39:13 +00:00
aaron	4204806c80	conversations.db, sessions.db: enable WAL, add message index; update backup.sh Both databases ran with journal_mode=delete — every write rewrote the rollback journal per transaction. WAL eliminates the journal-rewrite and lets readers run without blocking writers. Index on messages(conversation_id, timestamp DESC) is preventive — only 280 rows today, but the access pattern (load conversation history in order) is exactly what a composite index serves, and we don't want to re-revisit this when the table grows. backup.sh updated in the same commit because WAL changes the on-disk layout: a bare `cp` of just the .db file can miss recently-committed transactions that still live in the -wal sidecar, and can race with concurrent writes to produce a torn file. Switched to the SQLite Online Backup API via python3 -c "...src.backup(dst)..." — same mechanism as the sqlite3 CLI's `.backup` (which isn't installed on this host), handles WAL correctly without forcing a checkpoint, and is non-locking from the writer's perspective. Verified backup integrity_check returns ok and row counts match. Note: synchronous=NORMAL was considered but deferred — it's a per-connection PRAGMA, and applying it correctly requires a connect helper that wraps every sqlite3.connect() call site in api.py (~14 sites). Out of scope for this commit; tracked as a follow-up. WAL alone delivers the journal-rewrite elimination and reader/writer concurrency improvements; the additional fsync reduction from synchronous=NORMAL is a smaller marginal win on top. Confirmed via concurrency audit that api.py is the sole writer to both databases. ingest_conversations.py and dream.py are read-only consumers of conversations.db; nothing else touches sessions.db.	2026-05-04 03:24:51 +00:00
aaron	c5fc517fef	ingest_conversations.py: lazy-load embedder to match ingest.py pattern Embedder was instantiated at module import (~30-60s, ~200MB) regardless of whether new conversations existed. On nights with no new content (most nights per the logs), the script paid the load cost and exited immediately. ingest.py:134 already uses lazy loading; this brings the two ingest scripts into a consistent shape.	2026-05-04 03:13:45 +00:00
aaron	b35d44ef58	dream.py: cache the SentenceTransformer embedder across retrieve() calls Pipeline mode calls retrieve() three times (NREM, Early REM, Late REM). Previously each call re-imported and re-instantiated SentenceTransformer ("all-MiniLM-L6-v2"), allocating ~200MB and spending 30-60s on disk->CPU init three times sequentially. lru_cache(maxsize=1) makes the load happen once per process. Expected: pipeline runtime drops ~100-180s, removes 2x redundant 200MB allocations, and reduces transient memory pressure during the same window when other nightly jobs may run.	2026-05-04 03:11:22 +00:00
aaron	a27f22ceaf	api.py: switch whisper to distil-large-v3, beam_size=1, cpu_threads=4 Three changes to reduce voice-note transcription latency on the VPS: - Model: large-v3 -> distil-large-v3 (~6x faster, near-identical English accuracy; language is already hardcoded "en"). - beam_size: 5 (default) -> 1 (~3-4x faster on clean audio). - cpu_threads: 8 -> 4 (the box has 8 cores running api, dreamer, watcher, nextcloud concurrently; ctranslate2's inter-op pool plus context switching makes 4 effectively faster than 8 here). Combined effect expected ~10-15x over prior config. No accuracy regression expected for the voice-note use case (English, clean audio, domain terms already supplied via initial_prompt).	2026-05-04 01:00:32 +00:00
aaron	7c7b649775	embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C) Writers now enforce type and created_at: - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks 'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a re-ingest re-classifies type but does not overwrite a backfilled mtime. - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks convo.updated_at; re-runs should refresh). - Column-level NOT NULL is not added; application-layer raise gives a faster, more debuggable failure than a Postgres constraint error. Retrieval propagates type into chunks: - retrieve() SELECT now includes type; chunk dicts carry "type": etype. - WHERE clause built dynamically from excluded_sources and the new --type-filter CLI arg (experimental, default None, pgvector retrieval only — Graphiti chunks have no embeddings.type to filter on). - retrieve_graphiti unchanged; its chunks lack the type field. Manifests carry type_distribution per stage: - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem, early_rem, late_rem — a Counter over chunk types, filtering None so Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the distribution. Pgvector chunks always carry type post-backfill; if None appears, the backfill or writer enforcement has regressed. Verification: B1 force re-ingest of "Finite and infinite games -- James Carse.pdf": all 84 chunks preserved created_at=2026-04-27T06:11:55Z B2 missing-type assertion raises ValueError, no row leaked to embeddings B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter, type_filter only, excl 2 elems, excl 1 elem edge case, both}; all five plans use HNSW index scan with correct Filter clauses C1 retrieve("nrem") returns 8 chunks each carrying "type" key C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} — 2 distinct types, 62.5/37.5 split (looser bar: >=2 types, no single type >=90%) The type and created_at fields are now load-bearing: every dream manifest emits type_distribution per stage. Reverting the backfill makes the distribution show NULLs at every dream run.	2026-05-04 00:15:43 +00:00
aaron	3c7c228db0	embeddings: backfill type and created_at (Improvement #2 part A) Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit) and 12,109 created_at-NULL rows via five batches: C1 filepath_stat: 9,649 filesystem mtime via metadata.filepath C2 watcher_state_unique: 676 unique source-name lookup in watcher_state C3 watcher_state_collision_pick_latest_of_N: 234 collision; most-recent watcher mtime C4 chatgpt_export: 1,548 convo create_time from export JSONs (168/168 distinct convo_ids resolved) C5 sentinel: 2 2026-04-26T00:00:00Z (pgvector migration date) Provenance written to metadata.type_source and metadata.created_at_source on every row changed by this run. type_source is empty on rows where the type field was already populated pre-run; in those cases the snapshot table is the source of truth for what changed. Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type, created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join). Verification: V1 live counts: type_null=0 ca_null=0 V2 spot-check 11 rows across cohorts: provenance correct V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved V4 cross-check vs snapshot: reconciles per-provenance to dry-run Read-side use (B + C: writer enforcement + minimal retrieval read) deferred to a separate session. The backfill is complete and verified, but the type and created_at fields are not yet load-bearing — every current reader still ignores them. Without B+C this lands as data prep, not behavior change.	2026-05-03 23:58:53 +00:00
aaron	2df1a2fe01	docs/inventory: layer 2026-05-03 updates (resolutions, corrections, new findings) Inventory dated 2026-05-02 is preserved as a point-in-time snapshot. Today's updates are layered on top in a dated addendum section after "Findings summary" and before "Phase 1 — Scripts" so the original snapshot reads as written and readers can see what changed and when. Resolved: - NREM-shape divergence #1 (`dream.py` cumulative cross-night exclusion 500-cap) — replaced with session-scoped novelty. Corrections to existing findings: - `stage2_metadata` lives on `stage_3_queue`, not `stage_2_queue` (the 2026-05-02 entry implied otherwise). Verified by direct schema read. - Stage 2 char_length gate runs before the Mistral call. For sub-2000-char docs, Mistral is never invoked — frames are not extracted then discarded, they are simply not extracted. Reframes the architecture's "Stage 2 produces orientation for everything" commitment. New findings (from the 2026-05-03 frame analysis): - `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Combined with the char-gate exclusion and Stage 2 failures, only 56% of corpus has any frame data. - All 14 voice notes and all 39 dream outputs are in the 339-doc gap. Primary capture and self-reflection channels are silent to the frame system; dreamer cannot frame-condition on its own output. - File-type \u00d7 frame stratification provides discriminating signal that cross-links Improvement #3 to the existing `embeddings.type` NULL-rate finding. Same NREM shape as the original cumulative-exclusion bug — the architecture's stated commitment and what the code actually does diverge silently. This is exactly what the inventory exists to surface.	2026-05-03 20:32:55 +00:00
aaron	ed2d090afc	experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3 ) Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only.	2026-05-03 20:32:37 +00:00
aaron	e5898f3019	dream.py: replace cumulative cross-night exclusion with session-scoped novelty (Track 1 Finding 1) The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on overflow) was hiding ~40% of the corpus from Early REM and Late REM after the cap filled. The architecture and reframe both specify session-scoped novelty, not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02 NREM exclusion fix. Changes: - Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key from `dreamer_state.json` at pipeline start. - Early REM excludes only the current session's NREM high-scorers. - Late REM excludes only the current session's NREM \u222a Early REM. - Remove the across-night accumulation block at the end of the pipeline; reuse the in-scope state object for the post-pipeline metadata write (eliminates a redundant disk re-read that was reintroducing the legacy key). NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem", excluded_sources=None)`). Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key absent from `dreamer_state.json` post-run.	2026-05-03 20:32:15 +00:00
aaron	1101bef226	scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11) Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.	2026-05-03 01:40:47 +00:00
aaron	a317df66f8	dream: factor prompts into module-level templates, repair prompt_hash (Track 1 Finding 11) prompt_hash() in dream.py was hashing function __doc__ strings, but the synth functions don't have docstrings, so the hash was always MD5("") = d41d8cd9 for every dream. The manifest field meant to detect undeclared prompt drift carried no useful information. Refactor: - Each synth function's prompt template moved to a module-level constant (NREM_PROMPT_TEMPLATE, EARLY_REM_PROMPT_TEMPLATE, LATE_REM_PROMPT_TEMPLATE, SYNTHESIS_PROMPT_TEMPLATE, LUCID_PROMPT_TEMPLATE) using str.format() placeholders instead of f-string interpolation. - Synth functions call TEMPLATE.format(...) at use time. Output is byte- identical to the previous f-string implementation. - prompt_hash() now hashes the four pipeline template constants (lucid is on-demand, not part of the nightly manifest — preserves prior scope). - LUCID_DEFAULT_TASK extracted as a named constant from the lucid fallback question (factoring only, no behavior change). - PROMPT_VERSION_* constants and synth function signatures untouched. - v1.1 register-shift comment in synthesize_early_rem preserved inline. The post-fix hash will differ from d41d8cd9 (verified: b65695a1 in static test). Historical manifests still carry d41d8cd9; the discontinuity is intentional — pre-fix hashes were equally meaningless and faking continuity would be worse than acknowledging the break. Found by Track 1 inventory 2026-05-02 (Finding 11 / divergence #11). Verified static import + hash determinism before commit.	2026-05-03 00:24:21 +00:00
aaron	ec67e19b4f	docs/: track Track 1 inventory and reorg plan These are working artifacts of the 2026-05-02 Track 1 stabilization work. Versioning them alongside the code keeps the operational narrative coherent and gives future sessions clear reference docs. The inventory document includes the cross-repo verification finding on share_time — captured at the document level so future sessions don't repeat the same dead-code mischaracterization.	2026-05-03 00:00:16 +00:00
aaron	4b520b2bc2	api.py: minor cleanups (Track 1 inventory findings) - Fix /auth/check endpoint that referenced undefined SESSIONS (Phase 1 finding — would NameError 500 on every call). Now uses session_exists(token), the live session-validation mechanism defined elsewhere in api.py. - Remove unused DB_PATH ChromaDB-era constant (paired with the ChromaDB directory deletion and aaronai-maintenance.service removal earlier this session). Found by Track 1 inventory 2026-05-02. Cross-repo verification of share_time (third candidate from the original cleanup proposal) revealed it is working stores-and-returns persistence rather than dead code; share_time intentionally not modified. Inventory document edits are committed separately under the docs/ tracking decision.	2026-05-02 23:59:20 +00:00
aaron	7bebd8ae50	api.py: wire up dream_mode setting (Track 1 Finding 9) The dream_mode setting was defined in DEFAULT_SETTINGS and watched by update_settings for reschedule, but run_dream_job never read it — silently-ignored configuration. Two changes: 1. DEFAULT_SETTINGS["dream_mode"] flipped from "nrem" to "pipeline". The default was a latent regression vector: wiring up the setting without changing the default would have silently switched all default-config users from full-pipeline (current production behavior) to NREM-only nightly runs. 2. run_dream_job reads dream_mode at fire-time, validates against {"pipeline", "nrem", "early-rem", "late-rem"}, falls back to pipeline with a warning on invalid values. Lucid intentionally excluded — it is on-demand only by design and remains available via CLI and /api/dreamer/run. Nightly dream production behavior is unchanged for current users (no settings.json key → default "pipeline" → no flag passed → same as before). Users can now meaningfully change the nightly mode by editing settings.json or via the SettingsPanel. Found by Track 1 inventory 2026-05-02 (Finding 9 / divergence #9).	2026-05-02 23:38:29 +00:00
aaron	3f7fba7e0e	scripts/: separate production from experimental and deprecated Moves 28 experiment scripts to scripts/experiments/ (E1, E1.4, E1.6, E2, base_class, cascade, cost_test, briefing, consistency, token series). Moves 2 dissolved-layer scripts to scripts/deprecated/ (consolidator_v0_1.py, tier1_migration.py — under the bespoke decision both target retired substrate work). Removes 19 .bak* files from disk (gitignored, never tracked; git history is the durable record of every prior version). The 11 production scripts remain in scripts/. All systemd ExecStart paths, api.py subprocess calls, and cron jobs continue to resolve correctly — verified by grep against /etc/systemd/system/aaronai-*.service, scripts/ references in api.py, and the user crontab. Track 1 inventory cross-cutting finding: scripts/ mixed 11 production files with 32 experimental scripts and ~20 .bak files. After this commit a clean-room reader can identify the live workers from a directory listing alone. Found by Track 1 inventory 2026-05-02. See ~/aaronai/docs/scripts-reorg-plan-2026-05-02.md for full reasoning. After commit, run: 1. git log --oneline -3 — show the new commit on top 2. git status — confirm clean working tree (modulo the docs/ untracked files which are intentional)	2026-05-02 23:28:24 +00:00
aaron	6f2d274d5d	api.py: remove 50KB truncation from /api/corpus/retry (completes F14) The F14 fix on 2026-05-01 removed text[:50000] truncation from watcher.py, ingest.py, and corpus_integrity.py. The retry endpoint in api.py was missed — clicking 'Retry' on an ingest-failed file in the SettingsPanel re-introduced the exact truncation pattern F14 was meant to eliminate. Found by Track 1 inventory 2026-05-02 (Finding 2 / divergence #2).	2026-05-02 22:56:33 +00:00
aaron	7615dedf9e	dream: NREM does not exclude prior traces NREM in the reframe is replay-and-consolidation of recent encoded content. Excluding previously_retrieved sources turns NREM into novelty-finding, which is Late REM's job. NREM should re-traverse already-encoded content; that's what consolidation is. The May 2 abort surfaced this — 52 sources accumulated in the exclusion list, all of them in NREM's similarity band for the recurring research/fabrication/teaching query. The dreamer hit zero retrievable chunks not because the corpus was empty, but because everything semantically aligned was excluded. Late REM and Early REM keep the exclusion mechanism — novelty is their job. Session-scoped exclusion (nrem_high_sources flowing into Early REM) also preserved. The 500/400 trim on retrieved_sources is preserved for the remaining stages that still use it.	2026-05-02 21:33:49 +00:00
aaron	1a8e0353f5	stage3_worker: v2.2 — absolute sudo/systemctl paths, error logging, reset failure counter on recovery failure Mirrors stage2_worker v2.1 (`da98019`) resilience fixes: - Absolute paths for /usr/bin/sudo and /bin/systemctl - Log stdout/stderr when sidecar restart fails - Reset consecutive_failures even when wedge recovery fails (prevents permanent stuck state if restart itself is broken)	2026-05-01 18:40:25 +00:00
aaron	da980193dd	stage2_worker: v2.1 — terminal failure states + sudo path fix Three classes of silent failure converted to clean terminal states: - Mistral timeout: previously left rows in zombie state (started_at set, failed_at null, attempts incremented past retry threshold, row invisible to selection query). Now sets failed_at with reason 'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents accumulated in this state during the Stage 3 saga deadlock incident. - Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on JSON decode failure but process_one wasn't checking, so empty orientation ('Active frames: . Frame relationships: ...') was shipped to Stage 3. This is F22 from the 2026-04-30 code review. Now sets failed_at with reason 'mistral_parse_failure'. - Wedge recovery hammering: consecutive_failures was only reset on successful Ollama restart. With the sudo path bug (also fixed here), recovery always failed, so every subsequent failure re-attempted restart. Now resets the counter regardless and logs the failure visibly. Also: subprocess.run now uses absolute paths (/usr/bin/sudo, /bin/systemctl) instead of relying on PATH, fixing the 'No such file or directory: sudo' error that broke Stage 2's recover_wedge() since deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the PATH issue was masking that fix. Worker version bumped to 2.1 to match Stage 3's resilience patch level.	2026-05-01 17:28:53 +00:00
aaron	b936931668	Stage 3 worker v2.1 — saga-size limit + wedge detection + sudoers fixes Production incident 2026-05-01: F14 re-cascade attempt surfaced three compounding issues in cascade resilience. stage3_worker.py changes: - MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk commits, all sharing the same saga tag for Graphiti document linking. Original implementation sent all chunks as one saga; 17-19 chunk sagas deadlocked sidecar's Python-side coordination. - recover_wedge() function — restarts aaronai-graphiti.service when consecutive_failures hits threshold. Mirrors Stage 2 pattern. - run() loop adds consecutive_failures counter with threshold-2 escalation. Resolves F28 + F29 from code review. - Worker version bumped 2.0 -> 2.1. - post_bulk() helper extracts shared HTTP POST + error handling. Outside-repo changes (system config, separately documented): - WatchdogSec=600 commented in stage2 + stage3 systemd unit files. Workers have no sd_notify support; per-request timeouts in code handle the actual failure modes. - /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for systemctl restart ollama and restart aaronai-graphiti.service. Stage 2's existing recover_wedge() was silently broken since deployment due to this gap. .gitignore — added rules for *.bak files, runtime artifacts (watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json, watcher_state.json, watcher_status.json), Python cruft, virtual env, .env, editor/OS files, and Aaron AI runtime data (conversations.db, sessions.db, memory.md, settings.json). Untracked 11 files that shouldn't have been committed in `465f2f7` (this morning): backup files and runtime artifacts. Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB) through the patched worker after re-extracting full text from disk. Cascade in progress at commit time.	2026-05-01 05:18:09 +00:00
aaron	465f2f725b	Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37 - api.py: strip CV pinning workaround (parity violation, see architecture doc) - dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches 3x and filters in-process. Was silently dropping the parameter; would have confounded E3 with broken cross-stage exclusion in Graphiti arm. - watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was propagating through entire cascade. Postgres TEXT can hold up to 1GB. - corpus_integrity.py: F37 — same truncation, third path now clean. Backups: api.py.bak., dream.py.bak., watcher.py.bak., ingest.py.bak., corpus_integrity.py.bak.* timestamped pre-fix. Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected by F14, 414KB).	2026-05-01 02:26:37 +00:00
aaron	25e42c0231	corpus_integrity.py: write unreadables with retry_count=0 so OCR can retry when it ships	2026-04-30 22:03:48 +00:00
aaron	7822fb1cc1	corpus_integrity.py: write unreadable files to ingest_failures for UI visibility	2026-04-30 21:59:06 +00:00
aaron	74e2c34f43	corpus integrity: ingest_failures tracking in watcher, reconciliation script, corpus status/retry/reconcile endpoints	2026-04-30 21:54:39 +00:00
aaron	655dea6ae5	add remaining experiment result files	2026-04-30 18:06:52 +00:00
aaron	f11cacd9c9	add experiment scripts and results; watcher.py latest changes	2026-04-30 18:06:03 +00:00
aaron	1cf26df450	api.py: return error_type=transcription_failed on Whisper crash, frontend retry logic can now distinguish from network failures	2026-04-30 17:45:47 +00:00
aaron	7cd765146a	stage3_worker.py: log sidecar response body on non-200	2026-04-30 17:37:28 +00:00
aaron	58515ebec0	graphiti_service.py: add traceback logging, log file handler for all endpoints	2026-04-30 17:36:19 +00:00
aaron	91166367fa	E3: add Graphiti retrieval branch to dream.py, E3 experiment script with blinding	2026-04-30 17:17:28 +00:00
aaron	2b3c2380a0	watcher.py: in-process ingest, embedder loaded once at startup, startup recovery, heartbeat, no duplicate logging	2026-04-30 16:42:44 +00:00
aaron	2fb50cce71	ingest.py: guard Stage 2 enqueue behind SKIP_STAGE2_ENQUEUE env var for migration runs	2026-04-30 16:20:11 +00:00
aaron	c08f57a6f2	stage2/3 workers: remove duplicate StreamHandler, stdout captured by systemd	2026-04-30 16:12:51 +00:00
aaron	cae7fb8775	dream.py v1.1: score-band exclusion for Early REM, DREAMER_VERSION constant, manifest versioning	2026-04-30 15:51:11 +00:00
aaron	b53717af5b	dream.py: enrich manifest with retrieval breadth metrics	2026-04-30 06:14:55 +00:00
aaron	2b9a1782c1	feat: stage2/3 pipeline, taxonomy-free cascade, E1.8/E4 experiments, corpus migration state	2026-04-30 04:04:31 +00:00
aaron	62b5b5453a	fix: max_coroutines=2, saga support in sidecar; stage3 chunking; TIMEOUT_MAX 0 persistent in falkordb compose	2026-04-30 04:01:02 +00:00
aaron	95d022ec64	fix: FalkorDriver database=aaron, build indices on correct graph	2026-04-29 21:34:20 +00:00
aaron	d91a5675ff	capture: public SSE endpoint for transcription completion events	2026-04-29 18:00:54 +00:00
aaron	c42d898504	emit capture_saved SSE event when async transcription completes	2026-04-29 17:58:01 +00:00
aaron	a05fcec882	async voice transcription — return immediately, whisper runs in background	2026-04-29 17:48:22 +00:00
aaron	eb7cf3be10	upgrade whisper small -> large-v3, bump cpu_threads to 8	2026-04-29 17:35:03 +00:00
aaron	3f6c435be4	add client_time to chat context — user-supplied, not logged	2026-04-29 17:26:03 +00:00

1 2

87 Commits