aaronAI

Author	SHA1	Message	Date
aaron	f682d8c6a0	dream_observation.py: Stage 1 + 2 of the design spec — observe and select Implements `dreamer-design-spec.md` lines 27-74: observe_corpus() returns a signal vector (new_chunks delta, new_journal_entries, recent_questions over 14-day window, days_since_dream, underprocessed_count derived from the new consolidation cursor); select_mode() returns one of {nrem, early-rem, late-rem, lucid} or None per the spec's rules. The None return is the spec's canonical answer to the repetition problem (line 67) — "dreamer goes quiet rather than manufacturing novelty." Standalone for now. Not wired into dream_pipeline yet — that happens in the retrieve() refactor (task #46). dream.py is unchanged in this commit. Grounded sources cited in module docstring: Friston Active Inference, sleep research (Stickgold/Walker/Diekelberg & Born), sharp-wave ripples (Buzsáki). All three appear in BirdAI-Bibliography.md. Migration prerequisite (already shipped in the prior commit): consolidation cursor columns last_consolidated_at + consolidation_count added to embeddings. Backfill from dream-manifest history is task #49.	2026-05-20 17:57:38 +00:00
aaron	151c756b89	api.py: async chat-turn push to Graphiti After chat() returns, fire-and-forget background thread POSTs the (user message + assistant response) as one episode to /episodes. Default extraction (Sonnet). Errors logged, never raised — chat is not gated on the write. Wall-clock cost in the background is ~20 min per episode against the current ~4,300-entity graph. The chat experience is unaffected; the graph catches up with a delay. Search_facts queries reflect new turns once the sidecar has finished processing them. Kill-switch: SKIP_GRAPHITI_CHAT_PUSH=1 in the api service environment disables the push without code changes. Useful if dedup contention surfaces under sustained load. Companions to this commit: search_facts tool (`e96bf40`), orientation indexer worker (`e96bf40`), FalkorDB vector index patches (`d2ec20e`, `313c0f0`).	2026-05-20 05:08:07 +00:00
aaron	e96bf40b2f	plan B: search_facts chat tool + orientation indexer (read-only Graphiti) After establishing that single-episode Graphiti writes take ~20 min against the existing graph (the dedup loop is structurally slow regardless of the patches, the bridge, or the LLM model), the salvage plan is to stop trying to write to Graphiti and instead: 1. Use the existing 4,300-entity graph as a read-only fact layer at chat time via a new search_facts tool. Graphiti's /search endpoint is fast (~15ms direct, ~400ms over HTTP); the graph is stale-as-of-early-May but covers most biographical / relational content that "write me a bio" and similar queries care about. 2. Pipe Stage 2's document-level orientations into pgvector via a new orientation_indexer worker. Stage 2 already runs and writes orientation text to stage_3_queue for every Mistral-processed document; the worker reads those, embeds them, and writes one row per source to embeddings with metadata->>'kind'='orientation'. retrieve_documents now ranks against both chunk text and document-level concept summaries. Idempotent: the indexer's "is this already indexed" check is an EXISTS subquery against embeddings, so restarts and partial runs are safe. Out of scope (deliberately): no Graphiti writes from chat, no Stage 2 -> Graphiti bridge, no draining the 711-item stage_3_queue backlog into Graphiti. Rich-extraction posture stays a BirdAI concern.	2026-05-20 05:00:03 +00:00
aaron	313c0f0341	graphiti_service.py: bridge driver._search_ops to driver.search_interface graphiti-core 0.29.0 builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__ but never assigns it to driver.search_interface. search_utils.py dispatches on search_interface; without this one-line bridge it falls back to interpreted-Cypher cosine math doing full table scans for every entity dedup similarity check. Combined with the vendored patches in graphiti_patches/ (restored in the previous commit `d2ec20e`), this activates FalkorDB's native vector index for the dedup similarity path. Empirical impact (per the original `f645b74` commit message): single-episode add_episode against a ~4,277-entity graph went from indefinite hang to ~8.2 seconds. Surgical restore: cherry-picks only the bridge code from `f645b74` — not the Pattern 1 async job model, not the v2.4 extraction instructions, neither of which we want. Default extraction posture (taxonomy-naïve) stays the operating mode. Rich-extraction story remains a BirdAI concern.	2026-05-20 04:06:46 +00:00
aaron	10bb29290a	watcher: handle deletes; sweep_orphans cleans existing phantom chunks watcher.py now listens for on_deleted events and treats on_moved destinations that fall outside NEXTCLOUD_PATH (Nextcloud trashbin, moves to other volumes) as deletes. Both cases call delete_embeddings_for_path (DELETE WHERE metadata.filepath = ...) and remove_from_state to drop the file from watcher_state.json so it isn't carried as known-mtime. Match is by metadata.filepath, not source basename, so files that share a name across folders don't collide. scripts/sweep_orphans.py is the one-time cleanup for chunks the watcher missed before this fix: - Modern pass: rows with metadata.filepath whose file no longer exists. - Legacy pass: rows with NULL filepath and type='document' whose basename isn't anywhere on disk. type='document' restriction skips conversations and memory snapshots (synthetic sources, not files on disk). First run cleaned 629 rows: 628 from moved-file duplicates (e.g., BirdAI docs that traveled across Journal/, Library/, Journal/Projects/BirdAI/) plus the AARON_NELSON_BIO.pdf phantom Aaron flagged.	2026-05-20 02:52:00 +00:00
aaron	9bb083f065	chat: cap retrieve_documents per turn, truncate displayed citations, broaden lock-file skip - MAX_RETRIEVALS_PER_TURN (5): after five retrieve_documents calls in a single turn, further calls return a budget-exhausted message instead of executing. Caps cost on runaway multi-query loops without forbidding compound questions. - MAX_CITED_SOURCES (5): accumulated_sources was growing to 14+ entries across multiple tool calls and showing chunks Claude never actually used. Cap the list returned to the UI at 5, preserving insertion order so the highest-relevance early-call results survive. Proper fix (Claude-driven inline citations) is bigger work, noted for later. - ingest.py lock-file skip: changed prefix tuple from ("~$", ".") to ("~", ".") so it catches Office lock files even when Nextcloud's filesystem encoding has mangled the "$" into a unicode replacement char. Matches what watcher.py already does.	2026-05-20 02:22:54 +00:00
aaron	430ea239dd	api.py: drop save_document preview escape hatch — two-turn separation now unconditional Previous prompt let Aaron skip the preview if he asked up front. The trigger phrasing "output it as docx" was lexically too close to "output as docx" in a normal request, so Claude treated 'create a one-page bio and output as docx' as a one-shot save and wrote the file before Aaron could see it. Removed the escape hatch. Draft-then-commit is now the only flow.	2026-05-20 01:06:40 +00:00
aaron	0a1e2b4f61	api.py: preview-then-commit flow for save_document The previous system prompt instructed Claude to skip duplicating document content in chat and write the file directly. That produced no-preview UX: the user asked for a bio and the docx appeared in Drafts/ before they had a chance to read or refine it. Reversed: Claude now drafts in chat first, waits for an explicit save signal, and only then calls save_document. The explicit "skip preview" escape hatch is preserved for one-shot flows.	2026-05-20 01:01:45 +00:00
aaron	8c2c597687	api.py: save_document — distinguish PATH miss from missing install in error The systemd unit pins PATH to the venv only, so subprocess.run(['pandoc', ...]) raised FileNotFoundError even though pandoc was installed at /usr/bin/pandoc. The handler's "pandoc not installed" message was misleading — pandoc was reachable from a login shell but not from the service. Rephrased to point at the actual cause: the service's PATH. The systemd drop-in to extend PATH is not committed here (lives at /etc/systemd/system/aaronai.service.d/path.conf on the host).	2026-05-20 00:51:41 +00:00
aaron	fda61ad622	api.py: save_document tool — pandoc render to Nextcloud Drafts/ via WebDAV Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he asks for a document (bio, cover letter, statement, CV section) rather than chat text. Pandoc handles markdown -> docx and markdown -> pdf with the xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision check appends _2, _3, ... until unique. Filename sanitization strips path components and unsafe characters. System prompt instructs Claude to call save_document when the user wants a file (not chat text) and not to duplicate the file contents in the chat response — just write the file and tell Aaron where it landed. ingest.py and watcher.py now skip files under Drafts/ at ingest time so generated drafts don't pollute future retrieval. Drafts can still be opened, edited, and shipped; they just don't become part of the searchable corpus unless Aaron explicitly moves them out of Drafts/.	2026-05-20 00:41:26 +00:00
aaron	84994f9282	api.py: prompt-cache system prompt and memory across tool_use round-trip Move persistent memory from the user message into system blocks with cache_control: ephemeral on the last block. The static prefix (system prompt + memory, ~3-5K tokens typically) is identical between the two LLM calls of a tool_use round-trip and stable across turns within the 5-minute cache TTL. Without this, the tool-call retrieval architecture roughly doubled input token cost on retrieval-needed turns (full context billed twice). With cache reads at ~10% of standard input, the duplication cost drops by ~90% — the "twice as expensive" hit becomes "slightly more expensive plus tool overhead." client_time stays in the user message (per-turn dynamic, should not be in the cached prefix).	2026-05-19 23:13:43 +00:00
aaron	9e86297e2a	api.py: tool-call retrieval, drop the keyword intent classifier Removes classify_retrieval_intent and the type/folder filter parameters on retrieve_context. The keyword classifier was the same anti-pattern as the formatting-driven docx chunker: a heuristic that locks the user into specific phrasings and fails silently on anything novel. A scope enum (personal / library / conversations / memory) would have been the same heuristic in a fancier wrapper — the categories themselves are mine, not Aaron's. New shape: a retrieve_documents tool exposed to Claude. Tool takes a single query argument; the model decides when to call it, what to search for, and how many times per turn (multi-query falls out naturally for compound asks). Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt, but corpus content is fetched on demand by the model with concrete queries it crafts itself, not the user's raw phrasing. retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup, no filters. The reranker ranks, the model judges relevance. When ranking fails (e.g. abstract instructional queries pulling philosophy books), the right fix is a better reranker, not another query-time taxonomy. That work is acknowledged but deferred. System prompt updated to teach the model about the tool and to prefer concrete tokens (named entities, project names, course codes) over abstract phrasing when constructing search queries.	2026-05-19 23:05:25 +00:00
aaron	9955c7e383	encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak extract_blocks(filepath) is the new structured-extraction entry point, returning list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading prepended for retrieval context and stored in metadata). - pptx: one block per slide. Slide title becomes block heading; speaker notes fold into the body. Image-only decks with title-only slides now produce heading-only chunks instead of being recorded as extraction failures. - docx: deliberately single-block (back-compat). Heading-style section detection was implemented and rolled back: hand-formatted CVs are Normal-styled with bold-as-heading, and tying chunk boundaries to formatting choices would lock future-user into preserving those choices forever. Lexical + cross-encoder retrieval already handles substring matching inside blind-chunked CVs. - pdf/txt/md: unchanged (single block, blind chunking). Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it as secondary sort key in _rerank so memory/journal snapshots prefer the latest copy among near-duplicate content. reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a subset; previous hardcoded delete regex would have wiped both even with a single-ext target.	2026-05-19 21:58:25 +00:00
aaron	50b97e2998	api.py: folder-aware retrieval, near-duplicate dedup, folder in citations Three refinements to retrieve_context, all keyed off observed failures from test_retrieval.py: - Library/personal split. classify_retrieval_intent now returns (type_filter, folder_exclude_prefixes). Biographical document intent excludes Library/* so philosophy/cognition books stop crowding out CVs and dossiers for queries like "write me a bio". - Near-duplicate collapse. Multi-folder copies of the same file (e.g., several Teaching Philosophy.pdf in different application folders) used to fill the top-N with the same content. Dedup by first-300-chars hash after rerank. - Folder in source citations. Surface metadata.folder alongside basename so the LLM can disambiguate among 21 CV.docx variants and the user can see which copy a citation refers to. Also: bump hnsw.ef_search to 500 when a WHERE filter is present. pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a restrictive filter that excludes the nearest neighbors otherwise returns empty.	2026-05-19 21:35:28 +00:00
aaron	8d560f9f5e	api.py: hybrid retrieval with intent routing and cross-encoder rerank Replaces pure-dense top-8 retrieval with a three-stage pipeline: - BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel, fused with Reciprocal Rank Fusion - Optional type filter driven by classify_retrieval_intent() so questions about prior conversations don't pull documents and vice versa - Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before taking final top-N Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover table/header/text-box content in docx and pptx after the `93c0d89` extractor upgrade — and scripts/test_retrieval.py to exercise the new pipeline against representative queries. Schema: requires GIN index on to_tsvector('english', document) (already created out-of-band via psql since Apache AGE in shared_preload_libraries blocks ALTER TABLE on this database).	2026-05-19 21:11:15 +00:00
aaron	732e450d21	Stop silent data loss in voice capture pipeline Empty transcripts and transcription failures previously deleted the temp audio and returned without writing any record to disk — violating parity-at-encode (raw content is episodic context, not noise). - Preserve audio in Journal/Media/YYYY-MM/ on all paths (success, empty, failure) instead of unlinking. - Write a markdown entry to Journal/Captures/ on failure paths with status, audio_path, and error fields. - Add status: saved to successful captures so frontmatter is uniform across success and failure. - Fire SSE capture_saved events on all terminal paths, with status included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 23:41:51 +00:00
aaron	63c58b5bb3	Extend session lifetime to 365 days Single-user personal app threat model is theft-of-device, not stolen-cookie. 30-day idle re-prompts created friction without proportional security benefit. Server TTL and client max-age remain in sync via shared constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 23:29:38 +00:00
aaron	6c2af55e7e	Server-side session TTL enforcement - session_exists() now rejects rows older than 30 days, matching the client cookie max-age. - Opportunistic cleanup of expired rows on session_exists() calls, preventing unbounded growth of sessions.db from orphaned tokens (PWA reinstalls, manual cookie clears). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 23:28:39 +00:00
aaron	5b4a299414	encoding.py: write_embeddings_batch accepts commit parameter for transactional composition Adds an optional commit=True parameter to write_embeddings_batch. When True (default, matching prior behavior), the function commits the connection after the per-row UPSERT loop. When False, the caller manages the transaction. This unblocks fix #1 (pgvector-bypass paths) and fix #2 (watcher two-transaction pattern), both of which need to compose embeddings writes with other database writes in the same transaction. Without this lever, either fix would require duplicating the UPSERT logic outside this helper or introducing a second commit boundary inside an otherwise atomic operation. No behavior change for existing callers — they all use the default commit=True and continue working unchanged.	2026-05-05 02:52:33 +00:00
aaron	b09e35892c	encoding.py: strip frontmatter from .md at extraction time The capture endpoint (api.py:702, 833) writes Journal/Captures/.md files with a markdown-bold-style header block (`type:* voice`, `modality: audio`, `status: unprocessed`, optional `media:` and `project:`) followed by a `---` separator. extract_text for .md was a bare filepath.read_text, so every capture-derived chunk in pgvector embedded the frontmatter as raw text, polluting retrieval. Fix adds _strip_md_frontmatter, called only for the .md branch: - Capture-style: optional leading H1 (preserved), then consecutive `key: value` lines (and blanks), terminated by `---`. The H1 is retained; the key/value block + separator are removed. - YAML-style: file's first non-empty line is `---`, terminated by `---`. Only triggered when no heading precedes — guards against the common `# Title` + `---` (horizontal rule under heading) pattern seen in Journal/aaronai-architecture.md and four other Journal/.md files. Body `bold:` lines (e.g. `Visual description:` in image captures) and body `---` horizontal rules are never touched: the scan aborts as soon as a non-frontmatter line appears in the leading block. briefing_generator_v2.py's split("---", 1) heuristic was reviewed and not reused — fragile on substring matches and on documents with multiple `---` rules. Verified against: - 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1 retained. - 2026-04-27-04-34-image.md: frontmatter stripped, `Visual description:` and `Voice annotation:` body bold-headers retained, trailing `---` not consumed. - Journal/aaronai-architecture.md (5 body `---` rules): output byte-identical to read_text (96101 chars). - Synthetic YAML doc: stripped correctly when no leading heading. - Synthetic plain markdown with body `---` rules: untouched. - Empty input + heading-only file: untouched. Existing capture chunks in pgvector retain polluted text; the fix only affects future extractions. Backfill decision deferred — the cleanest path is `touch -h Journal/Captures/.md` to bump mtime and let the watcher re-ingest naturally on the next cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:20:55 +00:00
aaron	e38d283e59	watcher.py: exclude 3 image-only pptx files from ingestion Three files in the original ingest_failures cohort have been characterized via direct OCR and confirmed to lack ingestible text: - Presentations/Renders.pptx — 35 PICTURE-shape renders, 33/35 zero-char on OCR, 2 with noise (20 and 29 chars). - Presentations/Ribbon Cutting Slideshow.pptx — 10-slide event photo deck, 9/10 zero-char, 1 with 17 chars of noise. - Academic/DDF555 3D Computational/GH Slicer Notes [Autosaved].pptx — Office autosave duplicate of GH Slicer Notes.pptx; first 9 images byte-identical (sha256) to the canonical file. 2 net-new images contribute 36 noisy chars. Excluding to prevent double-embedding the same content under two source filenames. Pattern matches `f18fb64` (path.parts membership). Folder-level globs were considered and rejected: /Presentations/ contains successfully embedded text-bearing decks (aaronnelson_3D 4D.pptx, aaronnelson_slideslam.pptx). Exact-name + parent-folder membership applied in both watcher filter sites (get_changed_files and IngestHandler._should_ignore). The fourth file in the cohort, GH Slicer Notes.pptx (the canonical non-autosave version), was confirmed to carry 379 chars of real text (Grasshopper UI / code samples) across 6/9 images. It remains in ingest_failures unresolved, awaiting the eventual ocrmypdf backlog pass. Cleanup: 3 ingest_failures rows resolved (the excluded files). Unresolved count: 94 → 91. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:42:40 +00:00
aaron	7b77794319	api.py: enable PRAGMA foreign_keys=ON in _connect helper; clean up 2 message orphans The messages table declares FOREIGN KEY (conversation_id) REFERENCES conversations(id), but PRAGMA foreign_keys was never enabled — SQLite defaults it to OFF per connection, and _connect() did not set it. Two orphan rows existed in messages (conversation_id='test123' pointing at a never-existing conversation; both rows from one ~11-second test event on 2026-04-26). Audit before changing the PRAGMA: - All FOREIGN KEY declarations across both DBs (conversations.db, sessions.db) accounted for via PRAGMA foreign_key_list on each table. Only one FK exists: messages.conversation_id -> conversations.id, ON DELETE NO ACTION. - All tables enumerated via sqlite_master. Two tables in conversations.db (conversations, messages); one in sessions.db (sessions). No surprises. - PRAGMA foreign_key_check confirmed exactly the 2 known orphans and zero violations elsewhere. Both delete paths in api.py (delete_conversation at :471, and clear_all_conversations at :986) already delete from messages BEFORE conversations, so cascade behavior was correct in code. The orphan state was caused by a direct INSERT against a non-existent conversation_id at chat-test time, which an unenforced FK silently accepted. Turning the PRAGMA on prevents this class of bug at insert time, not delete time — no delete-path code changes were needed. Order of operations followed the constraint that orphan cleanup must precede PRAGMA-on (SQLite would not retroactively delete orphans, but foreign_key_check would surface them confusingly on any future operation that touched the messages table): 1. DELETE FROM messages WHERE conversation_id NOT IN (SELECT id FROM conversations) — removed the 2 known orphans. 2. Added PRAGMA foreign_keys=ON to _connect() so every connection from _connect_conversations() and _connect_sessions() gets FK enforcement (SQLite requires per-connection setting). 3. Restarted aaronai.service. Verification: - Smoke: GET /api/conversations and /api/conversations/{id}/messages both return 200 with expected payloads against the live api. - E2E single-delete: synthetic conversation + 2 messages inserted via the api's _connect helper (FK on); DELETE /api/conversations/{id} via the live endpoint removed both rows from both tables. - Clear-all e2e: skipped on live DB (destructive) — code shape is structurally identical to single-delete, no FK-relevant logic difference. - Load-bearing negative test: INSERT into messages with a non-existent conversation_id via _connect_conversations() raised sqlite3.IntegrityError("FOREIGN KEY constraint failed"). This is what proves the PRAGMA actually took effect, not just that we set it. Final counts: 7 conversations, 290 messages (down from 292 by the 2 orphans cleaned up). Note: an explicit BEGIN/COMMIT around the two-execute delete paths was considered and skipped. SQLite's implicit-transactional default already gives the atomicity needed; explicit transactions would be clarity-only and belong in a separate commit.	2026-05-04 16:41:55 +00:00
aaron	d985f9e91e	dream.py: raise_for_status on manifest writes; total_chunks as actual corpus count Two correctness bugs in dream_pipeline manifest assembly. write_manifest at lines 487-491 swallowed HTTP 4xx/5xx responses silently. requests.put() only raises on transport-level errors (DNS, connection refused, timeout); 401/403/500/507 come back as Response objects and never trigger the except. The code printed "Manifest written" while the manifest never persisted. The same file's deliver() function at line 434 already used response.raise_for_status() — the pattern was already established, write_manifest just skipped it. Fix: bind the response and call raise_for_status() before the success print. The except message changes from "(non-critical)" to "manifest not persisted" because HTTP failure now means manifest data was lost, which is critical, not quiet. corpus_data["total_chunks"] at lines 621-622 stored delta["new_chunks"], duplicating the sibling field new_chunks_since_last_dream. The field name claimed absolute corpus size; the value was a delta of recently-touched files. Verified in live manifests: total_chunks: 0 while pgvector held 11,379+ document embeddings. Fix: query SELECT COUNT() FROM embeddings inside dream_pipeline, store as total_chunks. Tightly-scoped one-shot connect via the existing get_pg() helper. Telemetry query failure is treated as non-critical and falls back to 0 — pgvector hiccup should not crash an otherwise successful dream pipeline. Bonus finding (not fixed in this commit): new_chunks_since_last_dream is itself misnamed. observe_corpus() reads the watcher's mtime cache and counts files (not chunks) whose mtime is newer than last_dream. Both fields were "files touched since last dream" duplicated under two different names; this commit fixes only the total_chunks semantics. Renaming new_chunks_since_last_dream is out of scope — manifests are write-only telemetry today, no consumer reads either field, and the rename is a separate decision. Verification: real pipeline run produced manifest with total_chunks matching SELECT COUNT() directly; doubled as a smoke test for the embedder cache (single Loading weights line), type_distribution propagation, and the manifest write success path.	2026-05-04 16:29:04 +00:00
aaron	b9eea6cb62	watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes Three rows in ingest_failures were Office lockfile leftovers whose filename starts with ~� (~ followed by the UTF-8 replacement character) instead of ~$. Somewhere in the Nextcloud sync chain the $ byte was lost or replaced; the file now lives on disk as a real file with this corrupted name. The watcher's ("~$", ".") prefix filter didn't match, so each cycle tried to ingest these as pptx, hit BadZipFile inside python-pptx (lockfiles aren't real Office documents), and they ended up permanently in ingest_failures. Three filter sites in watcher.py applied the lockfile prefix check: - ingest_file() at :127 - get_changed_files() at :200 - IngestHandler._should_ignore() at :290 All three now match ("~$", "~", ".") — broadened to catch any tilde prefix, not just ~$. The cross-check against pgvector embeddings and disk found zero legitimate tilde-prefixed files in the corpus, so the broader filter has no false-positive risk in this corpus. Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%'). Unresolved count drops 97 → 94. If a fourth filter site is ever added, the right shape is consolidating the lockfile prefix check to a shared function or constant. Three parallel sites with three different tuple orderings is acceptable for now but worth normalizing if the surface grows.	2026-05-04 16:19:56 +00:00
aaron	93c0d89308	encoding.py: extend docx and pptx extractors to walk tables, headers/footers, text-boxes, group shapes, and notes The previous extractors walked only top-level body paragraphs (docx) and top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text" ingest failures revealed that 13 docx files in the failure cohort have 100% of their content in tables (paras_with_text=0, table_cells=6-108). These are syllabi, rosters, rubrics, and homework worksheets structured as a single document-wide table — high-value academic content the corpus was silently missing. docx walker now covers: - body paragraphs (existing) - tables, including nested tables in cells (recursive helper) - header and footer paragraphs per section - text-box content via XPath against w:txbxContent (no first-class API in python-docx; future-proofing — none of the current failure cohort has text-boxes) pptx walker now covers: - top-level shape text (existing) - recursive descent into group shapes - table cell text via shape.has_table / shape.table.iter_cells() - speaker notes via slide.notes_slide.notes_text_frame.text Out of scope: SmartArt diagrams, chart titles/labels, OLE objects, content controls. None of the current failure cohort has these. Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two GH Slicer Notes variants — all PICTURE-shape decks with no text in any walkable structure). They stay in ingest_failures unresolved, awaiting OCR or path exclusion. Side effect worth noting: the regression check on 4 known-good files that were already producing embeddings showed all four gained content under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars (+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew 22,259 to 23,546 (+1,287). These files have been in the corpus all along with table or notes content silently dropped. They will surface the additional content on next re-ingest, improving retrieval quality for any future query that touches them. Cleanup: ingest_file already calls resolve_ingest_failure on successful ingest, so the 13 recovered files were marked resolved=TRUE during the retry pass. No separate cleanup SQL was needed.	2026-05-04 16:12:56 +00:00
aaron	f18fb64fe5	watcher.py: exclude generative-graphic folders and zero-byte files Two-sample diagnostic of the 128 ingest_failures rows surfaced two folders whose contents are exclusively non-text PDFs (iText-produced generative graphics from Processing sketches and computational design sketches) and three zero-byte test artifacts. None of these have ever produced an embedding chunk, and they have nothing extractable to contribute. Excluding them removes 19 / 128 (15%) of the locked-out failures from the cohort and prevents future versions of the same patterns from re-failing. Folder exclusions use path.parts membership rather than substring matching — eliminates false-match risk if similarly-named folders appear elsewhere in the corpus (e.g. an unrelated "Generative Design" or "Computational Design 2017" directory created later). The existing "Admin/Backups" / "Journal/Media" substring checks are looser, but new exclusions take the tighter pattern. Zero-byte filter goes in get_changed_files() only — the actual ingestion gate. Adding stat() to _should_ignore() (the FS-event noise filter) would introduce a race where the file is gone between event fire and stat call. Empty files briefly trigger pending=True but produce no work after debounce; cosmetic only. Cleanup applied separately via UPDATE: 19 ingest_failures rows for these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110. Verified: get_changed_files() with empty state returns 1418 changed files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent from the result, control file present. Watcher service restarted clean; startup scan reports no missed files.	2026-05-04 06:24:08 +00:00
aaron	72e07afc03	watcher.py: do not mark failed ingests as successfully ingested ingest_files() updated state[path] = mtime unconditionally after every ingest_file() call. ingest_file() returns 0 when text extraction fails, embedding fails, no chunks are produced, or the pgvector write fails — in every one of those cases, the path was still recorded as ingested at the current mtime. On the next pass, get_changed_files() saw the mtime match and skipped the file, locking it out of the corpus until something modified it on disk. record_ingest_failure() writes to a UI-visible failures table, but nothing reads that table to retry. So failures accumulated silently: the file was simultaneously logged as failed AND tracked in watcher_state as up-to-date, and the second condition won. Fix: only update watcher_state when ingest_file returns count > 0. Failed ingests will be retried on the next watcher cycle until they succeed or are explicitly excluded. Diagnostic at fix time: 129 rows in ingest_failures, 128 currently locked out of the corpus (filepath in watcher_state with mtime matching current disk). 128/129 are text_extraction failures, mostly scanned PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer exists on disk. 0 have had their disk mtime change since failing — i.e. without this fix, none of them would ever retry. Cross-check shows watcher_state has 1466 paths vs. 1061 distinct sources in pgvector embeddings, leaving a residual silent-gap of ~276 files after accounting for failures. Historical cleanup of files already locked out by this bug is tracked separately. New failures from this commit forward will retry naturally.	2026-05-04 03:52:01 +00:00
aaron	c3011c80a5	api.py: route all sqlite3.connect() through helpers; enable synchronous=NORMAL per-conn Followup to `4204806` (WAL + index + backup.sh). The previous commit deferred synchronous=NORMAL because it's a per-connection PRAGMA and api.py has 16 sqlite3.connect() call sites — setting it once at init would have applied to nothing afterwards. Adds three helpers near the *_DB constants: - _connect(path): inner; sets PRAGMA synchronous=NORMAL and uses timeout=5.0 (5000ms busy_timeout) on every new connection. - _connect_conversations(), _connect_sessions(): named wrappers so call sites read explicitly. Mechanical replacement at all 16 call sites: 4 sessions, 12 conversations. No semantic change beyond the PRAGMA + busy_timeout — every site still opens-then-closes, no held-open connections. busy_timeout=5000ms is cheap insurance: under WAL with api.py as sole writer, contention should be near-zero, but the backup.sh online-backup path briefly holds a read lock on the source, and any future second writer would otherwise hit SQLITE_BUSY immediately on contention. Combined effect with WAL: per-write fsync count drops from ~2 to ~1 (WAL alone) further reduced by synchronous=NORMAL deferring fsyncs to checkpoint boundaries. No durability loss for the use case (single host, app crash tolerated, OS crash gives at most one lost transaction). Not included: foreign_keys=ON. Audit found 2 orphan rows in messages (conversation_id pointing to deleted conversations) and untested write paths that could begin raising IntegrityError. Tracked as separate followup: inspect orphans, identify the delete path that didn't cascade, clean up, then enable enforcement and test chat delete flow end-to-end.	2026-05-04 03:39:13 +00:00
aaron	4204806c80	conversations.db, sessions.db: enable WAL, add message index; update backup.sh Both databases ran with journal_mode=delete — every write rewrote the rollback journal per transaction. WAL eliminates the journal-rewrite and lets readers run without blocking writers. Index on messages(conversation_id, timestamp DESC) is preventive — only 280 rows today, but the access pattern (load conversation history in order) is exactly what a composite index serves, and we don't want to re-revisit this when the table grows. backup.sh updated in the same commit because WAL changes the on-disk layout: a bare `cp` of just the .db file can miss recently-committed transactions that still live in the -wal sidecar, and can race with concurrent writes to produce a torn file. Switched to the SQLite Online Backup API via python3 -c "...src.backup(dst)..." — same mechanism as the sqlite3 CLI's `.backup` (which isn't installed on this host), handles WAL correctly without forcing a checkpoint, and is non-locking from the writer's perspective. Verified backup integrity_check returns ok and row counts match. Note: synchronous=NORMAL was considered but deferred — it's a per-connection PRAGMA, and applying it correctly requires a connect helper that wraps every sqlite3.connect() call site in api.py (~14 sites). Out of scope for this commit; tracked as a follow-up. WAL alone delivers the journal-rewrite elimination and reader/writer concurrency improvements; the additional fsync reduction from synchronous=NORMAL is a smaller marginal win on top. Confirmed via concurrency audit that api.py is the sole writer to both databases. ingest_conversations.py and dream.py are read-only consumers of conversations.db; nothing else touches sessions.db.	2026-05-04 03:24:51 +00:00
aaron	c5fc517fef	ingest_conversations.py: lazy-load embedder to match ingest.py pattern Embedder was instantiated at module import (~30-60s, ~200MB) regardless of whether new conversations existed. On nights with no new content (most nights per the logs), the script paid the load cost and exited immediately. ingest.py:134 already uses lazy loading; this brings the two ingest scripts into a consistent shape.	2026-05-04 03:13:45 +00:00
aaron	b35d44ef58	dream.py: cache the SentenceTransformer embedder across retrieve() calls Pipeline mode calls retrieve() three times (NREM, Early REM, Late REM). Previously each call re-imported and re-instantiated SentenceTransformer ("all-MiniLM-L6-v2"), allocating ~200MB and spending 30-60s on disk->CPU init three times sequentially. lru_cache(maxsize=1) makes the load happen once per process. Expected: pipeline runtime drops ~100-180s, removes 2x redundant 200MB allocations, and reduces transient memory pressure during the same window when other nightly jobs may run.	2026-05-04 03:11:22 +00:00
aaron	a27f22ceaf	api.py: switch whisper to distil-large-v3, beam_size=1, cpu_threads=4 Three changes to reduce voice-note transcription latency on the VPS: - Model: large-v3 -> distil-large-v3 (~6x faster, near-identical English accuracy; language is already hardcoded "en"). - beam_size: 5 (default) -> 1 (~3-4x faster on clean audio). - cpu_threads: 8 -> 4 (the box has 8 cores running api, dreamer, watcher, nextcloud concurrently; ctranslate2's inter-op pool plus context switching makes 4 effectively faster than 8 here). Combined effect expected ~10-15x over prior config. No accuracy regression expected for the voice-note use case (English, clean audio, domain terms already supplied via initial_prompt).	2026-05-04 01:00:32 +00:00
aaron	7c7b649775	embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C) Writers now enforce type and created_at: - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks 'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a re-ingest re-classifies type but does not overwrite a backfilled mtime. - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks convo.updated_at; re-runs should refresh). - Column-level NOT NULL is not added; application-layer raise gives a faster, more debuggable failure than a Postgres constraint error. Retrieval propagates type into chunks: - retrieve() SELECT now includes type; chunk dicts carry "type": etype. - WHERE clause built dynamically from excluded_sources and the new --type-filter CLI arg (experimental, default None, pgvector retrieval only — Graphiti chunks have no embeddings.type to filter on). - retrieve_graphiti unchanged; its chunks lack the type field. Manifests carry type_distribution per stage: - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem, early_rem, late_rem — a Counter over chunk types, filtering None so Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the distribution. Pgvector chunks always carry type post-backfill; if None appears, the backfill or writer enforcement has regressed. Verification: B1 force re-ingest of "Finite and infinite games -- James Carse.pdf": all 84 chunks preserved created_at=2026-04-27T06:11:55Z B2 missing-type assertion raises ValueError, no row leaked to embeddings B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter, type_filter only, excl 2 elems, excl 1 elem edge case, both}; all five plans use HNSW index scan with correct Filter clauses C1 retrieve("nrem") returns 8 chunks each carrying "type" key C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} — 2 distinct types, 62.5/37.5 split (looser bar: >=2 types, no single type >=90%) The type and created_at fields are now load-bearing: every dream manifest emits type_distribution per stage. Reverting the backfill makes the distribution show NULLs at every dream run.	2026-05-04 00:15:43 +00:00
aaron	3c7c228db0	embeddings: backfill type and created_at (Improvement #2 part A) Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit) and 12,109 created_at-NULL rows via five batches: C1 filepath_stat: 9,649 filesystem mtime via metadata.filepath C2 watcher_state_unique: 676 unique source-name lookup in watcher_state C3 watcher_state_collision_pick_latest_of_N: 234 collision; most-recent watcher mtime C4 chatgpt_export: 1,548 convo create_time from export JSONs (168/168 distinct convo_ids resolved) C5 sentinel: 2 2026-04-26T00:00:00Z (pgvector migration date) Provenance written to metadata.type_source and metadata.created_at_source on every row changed by this run. type_source is empty on rows where the type field was already populated pre-run; in those cases the snapshot table is the source of truth for what changed. Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type, created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join). Verification: V1 live counts: type_null=0 ca_null=0 V2 spot-check 11 rows across cohorts: provenance correct V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved V4 cross-check vs snapshot: reconciles per-provenance to dry-run Read-side use (B + C: writer enforcement + minimal retrieval read) deferred to a separate session. The backfill is complete and verified, but the type and created_at fields are not yet load-bearing — every current reader still ignores them. Without B+C this lands as data prep, not behavior change.	2026-05-03 23:58:53 +00:00
aaron	ed2d090afc	experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3 ) Read-only inspection of the frame data Mistral produces in Stage 2, in service of Track 2 substrate design (Step 2.4 operation set spec). Artifacts: - New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata` (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured fields so worker-version drift is inspectable). - Analysis script: frequency, label-hygiene collisions, per-doc count, co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split, data-gap accounting, corpus-wide coverage. - JSON sidecar for diff-across-runs reproducibility. - Markdown report with explicit Track 2 viability section. Headline findings: - Frames cluster meaningfully on the framed-doc subset (subject to validation on larger samples for the file-type cross-tab). - Only 56% of corpus has frame coverage. 198 conversation sources bypass Stage 2 by design (`ingest_conversations.py` writes directly to embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate; 12 Stage 2 failures. - All 14 voice notes and all 39 dream outputs are in the data gap. Primary capture and self-reflection channels are silent to the frame system. Dreamer cannot frame-condition on its own output. - 54 normalized label collisions (`Professional Experience` vs `Professional_Experience`, etc.) — any router must normalize first. - "Education" is a near-universal frame (36% of frame-extracted docs); cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish prompt artifact from corpus shape. - File-type \u00d7 frame stratification is concrete signal that ties to Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of rows. No production code touched. View is droppable; script is read-only.	2026-05-03 20:32:37 +00:00
aaron	e5898f3019	dream.py: replace cumulative cross-night exclusion with session-scoped novelty (Track 1 Finding 1) The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on overflow) was hiding ~40% of the corpus from Early REM and Late REM after the cap filled. The architecture and reframe both specify session-scoped novelty, not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02 NREM exclusion fix. Changes: - Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key from `dreamer_state.json` at pipeline start. - Early REM excludes only the current session's NREM high-scorers. - Late REM excludes only the current session's NREM \u222a Early REM. - Remove the across-night accumulation block at the end of the pipeline; reuse the in-scope state object for the post-pipeline metadata write (eliminates a redundant disk re-read that was reintroducing the legacy key). NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem", excluded_sources=None)`). Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key absent from `dreamer_state.json` post-run.	2026-05-03 20:32:15 +00:00
aaron	1101bef226	scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11) Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.	2026-05-03 01:40:47 +00:00
aaron	a317df66f8	dream: factor prompts into module-level templates, repair prompt_hash (Track 1 Finding 11) prompt_hash() in dream.py was hashing function __doc__ strings, but the synth functions don't have docstrings, so the hash was always MD5("") = d41d8cd9 for every dream. The manifest field meant to detect undeclared prompt drift carried no useful information. Refactor: - Each synth function's prompt template moved to a module-level constant (NREM_PROMPT_TEMPLATE, EARLY_REM_PROMPT_TEMPLATE, LATE_REM_PROMPT_TEMPLATE, SYNTHESIS_PROMPT_TEMPLATE, LUCID_PROMPT_TEMPLATE) using str.format() placeholders instead of f-string interpolation. - Synth functions call TEMPLATE.format(...) at use time. Output is byte- identical to the previous f-string implementation. - prompt_hash() now hashes the four pipeline template constants (lucid is on-demand, not part of the nightly manifest — preserves prior scope). - LUCID_DEFAULT_TASK extracted as a named constant from the lucid fallback question (factoring only, no behavior change). - PROMPT_VERSION_* constants and synth function signatures untouched. - v1.1 register-shift comment in synthesize_early_rem preserved inline. The post-fix hash will differ from d41d8cd9 (verified: b65695a1 in static test). Historical manifests still carry d41d8cd9; the discontinuity is intentional — pre-fix hashes were equally meaningless and faking continuity would be worse than acknowledging the break. Found by Track 1 inventory 2026-05-02 (Finding 11 / divergence #11). Verified static import + hash determinism before commit.	2026-05-03 00:24:21 +00:00
aaron	4b520b2bc2	api.py: minor cleanups (Track 1 inventory findings) - Fix /auth/check endpoint that referenced undefined SESSIONS (Phase 1 finding — would NameError 500 on every call). Now uses session_exists(token), the live session-validation mechanism defined elsewhere in api.py. - Remove unused DB_PATH ChromaDB-era constant (paired with the ChromaDB directory deletion and aaronai-maintenance.service removal earlier this session). Found by Track 1 inventory 2026-05-02. Cross-repo verification of share_time (third candidate from the original cleanup proposal) revealed it is working stores-and-returns persistence rather than dead code; share_time intentionally not modified. Inventory document edits are committed separately under the docs/ tracking decision.	2026-05-02 23:59:20 +00:00
aaron	7bebd8ae50	api.py: wire up dream_mode setting (Track 1 Finding 9) The dream_mode setting was defined in DEFAULT_SETTINGS and watched by update_settings for reschedule, but run_dream_job never read it — silently-ignored configuration. Two changes: 1. DEFAULT_SETTINGS["dream_mode"] flipped from "nrem" to "pipeline". The default was a latent regression vector: wiring up the setting without changing the default would have silently switched all default-config users from full-pipeline (current production behavior) to NREM-only nightly runs. 2. run_dream_job reads dream_mode at fire-time, validates against {"pipeline", "nrem", "early-rem", "late-rem"}, falls back to pipeline with a warning on invalid values. Lucid intentionally excluded — it is on-demand only by design and remains available via CLI and /api/dreamer/run. Nightly dream production behavior is unchanged for current users (no settings.json key → default "pipeline" → no flag passed → same as before). Users can now meaningfully change the nightly mode by editing settings.json or via the SettingsPanel. Found by Track 1 inventory 2026-05-02 (Finding 9 / divergence #9).	2026-05-02 23:38:29 +00:00
aaron	3f7fba7e0e	scripts/: separate production from experimental and deprecated Moves 28 experiment scripts to scripts/experiments/ (E1, E1.4, E1.6, E2, base_class, cascade, cost_test, briefing, consistency, token series). Moves 2 dissolved-layer scripts to scripts/deprecated/ (consolidator_v0_1.py, tier1_migration.py — under the bespoke decision both target retired substrate work). Removes 19 .bak* files from disk (gitignored, never tracked; git history is the durable record of every prior version). The 11 production scripts remain in scripts/. All systemd ExecStart paths, api.py subprocess calls, and cron jobs continue to resolve correctly — verified by grep against /etc/systemd/system/aaronai-*.service, scripts/ references in api.py, and the user crontab. Track 1 inventory cross-cutting finding: scripts/ mixed 11 production files with 32 experimental scripts and ~20 .bak files. After this commit a clean-room reader can identify the live workers from a directory listing alone. Found by Track 1 inventory 2026-05-02. See ~/aaronai/docs/scripts-reorg-plan-2026-05-02.md for full reasoning. After commit, run: 1. git log --oneline -3 — show the new commit on top 2. git status — confirm clean working tree (modulo the docs/ untracked files which are intentional)	2026-05-02 23:28:24 +00:00
aaron	6f2d274d5d	api.py: remove 50KB truncation from /api/corpus/retry (completes F14) The F14 fix on 2026-05-01 removed text[:50000] truncation from watcher.py, ingest.py, and corpus_integrity.py. The retry endpoint in api.py was missed — clicking 'Retry' on an ingest-failed file in the SettingsPanel re-introduced the exact truncation pattern F14 was meant to eliminate. Found by Track 1 inventory 2026-05-02 (Finding 2 / divergence #2).	2026-05-02 22:56:33 +00:00
aaron	7615dedf9e	dream: NREM does not exclude prior traces NREM in the reframe is replay-and-consolidation of recent encoded content. Excluding previously_retrieved sources turns NREM into novelty-finding, which is Late REM's job. NREM should re-traverse already-encoded content; that's what consolidation is. The May 2 abort surfaced this — 52 sources accumulated in the exclusion list, all of them in NREM's similarity band for the recurring research/fabrication/teaching query. The dreamer hit zero retrievable chunks not because the corpus was empty, but because everything semantically aligned was excluded. Late REM and Early REM keep the exclusion mechanism — novelty is their job. Session-scoped exclusion (nrem_high_sources flowing into Early REM) also preserved. The 500/400 trim on retrieved_sources is preserved for the remaining stages that still use it.	2026-05-02 21:33:49 +00:00
aaron	1a8e0353f5	stage3_worker: v2.2 — absolute sudo/systemctl paths, error logging, reset failure counter on recovery failure Mirrors stage2_worker v2.1 (`da98019`) resilience fixes: - Absolute paths for /usr/bin/sudo and /bin/systemctl - Log stdout/stderr when sidecar restart fails - Reset consecutive_failures even when wedge recovery fails (prevents permanent stuck state if restart itself is broken)	2026-05-01 18:40:25 +00:00
aaron	da980193dd	stage2_worker: v2.1 — terminal failure states + sudo path fix Three classes of silent failure converted to clean terminal states: - Mistral timeout: previously left rows in zombie state (started_at set, failed_at null, attempts incremented past retry threshold, row invisible to selection query). Now sets failed_at with reason 'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents accumulated in this state during the Stage 3 saga deadlock incident. - Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on JSON decode failure but process_one wasn't checking, so empty orientation ('Active frames: . Frame relationships: ...') was shipped to Stage 3. This is F22 from the 2026-04-30 code review. Now sets failed_at with reason 'mistral_parse_failure'. - Wedge recovery hammering: consecutive_failures was only reset on successful Ollama restart. With the sudo path bug (also fixed here), recovery always failed, so every subsequent failure re-attempted restart. Now resets the counter regardless and logs the failure visibly. Also: subprocess.run now uses absolute paths (/usr/bin/sudo, /bin/systemctl) instead of relying on PATH, fixing the 'No such file or directory: sudo' error that broke Stage 2's recover_wedge() since deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the PATH issue was masking that fix. Worker version bumped to 2.1 to match Stage 3's resilience patch level.	2026-05-01 17:28:53 +00:00
aaron	b936931668	Stage 3 worker v2.1 — saga-size limit + wedge detection + sudoers fixes Production incident 2026-05-01: F14 re-cascade attempt surfaced three compounding issues in cascade resilience. stage3_worker.py changes: - MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk commits, all sharing the same saga tag for Graphiti document linking. Original implementation sent all chunks as one saga; 17-19 chunk sagas deadlocked sidecar's Python-side coordination. - recover_wedge() function — restarts aaronai-graphiti.service when consecutive_failures hits threshold. Mirrors Stage 2 pattern. - run() loop adds consecutive_failures counter with threshold-2 escalation. Resolves F28 + F29 from code review. - Worker version bumped 2.0 -> 2.1. - post_bulk() helper extracts shared HTTP POST + error handling. Outside-repo changes (system config, separately documented): - WatchdogSec=600 commented in stage2 + stage3 systemd unit files. Workers have no sd_notify support; per-request timeouts in code handle the actual failure modes. - /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for systemctl restart ollama and restart aaronai-graphiti.service. Stage 2's existing recover_wedge() was silently broken since deployment due to this gap. .gitignore — added rules for *.bak files, runtime artifacts (watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json, watcher_state.json, watcher_status.json), Python cruft, virtual env, .env, editor/OS files, and Aaron AI runtime data (conversations.db, sessions.db, memory.md, settings.json). Untracked 11 files that shouldn't have been committed in `465f2f7` (this morning): backup files and runtime artifacts. Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB) through the patched worker after re-extracting full text from disk. Cascade in progress at commit time.	2026-05-01 05:18:09 +00:00
aaron	465f2f725b	Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37 - api.py: strip CV pinning workaround (parity violation, see architecture doc) - dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches 3x and filters in-process. Was silently dropping the parameter; would have confounded E3 with broken cross-stage exclusion in Graphiti arm. - watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was propagating through entire cascade. Postgres TEXT can hold up to 1GB. - corpus_integrity.py: F37 — same truncation, third path now clean. Backups: api.py.bak., dream.py.bak., watcher.py.bak., ingest.py.bak., corpus_integrity.py.bak.* timestamped pre-fix. Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected by F14, 414KB).	2026-05-01 02:26:37 +00:00
aaron	25e42c0231	corpus_integrity.py: write unreadables with retry_count=0 so OCR can retry when it ships	2026-04-30 22:03:48 +00:00
aaron	7822fb1cc1	corpus_integrity.py: write unreadable files to ingest_failures for UI visibility	2026-04-30 21:59:06 +00:00
aaron	74e2c34f43	corpus integrity: ingest_failures tracking in watcher, reconciliation script, corpus status/retry/reconcile endpoints	2026-04-30 21:54:39 +00:00

1 2 3

102 Commits