aaronAI

Author	SHA1	Message	Date
aaron	10bb29290a	watcher: handle deletes; sweep_orphans cleans existing phantom chunks watcher.py now listens for on_deleted events and treats on_moved destinations that fall outside NEXTCLOUD_PATH (Nextcloud trashbin, moves to other volumes) as deletes. Both cases call delete_embeddings_for_path (DELETE WHERE metadata.filepath = ...) and remove_from_state to drop the file from watcher_state.json so it isn't carried as known-mtime. Match is by metadata.filepath, not source basename, so files that share a name across folders don't collide. scripts/sweep_orphans.py is the one-time cleanup for chunks the watcher missed before this fix: - Modern pass: rows with metadata.filepath whose file no longer exists. - Legacy pass: rows with NULL filepath and type='document' whose basename isn't anywhere on disk. type='document' restriction skips conversations and memory snapshots (synthetic sources, not files on disk). First run cleaned 629 rows: 628 from moved-file duplicates (e.g., BirdAI docs that traveled across Journal/, Library/, Journal/Projects/BirdAI/) plus the AARON_NELSON_BIO.pdf phantom Aaron flagged.	2026-05-20 02:52:00 +00:00
aaron	fda61ad622	api.py: save_document tool — pandoc render to Nextcloud Drafts/ via WebDAV Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he asks for a document (bio, cover letter, statement, CV section) rather than chat text. Pandoc handles markdown -> docx and markdown -> pdf with the xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision check appends _2, _3, ... until unique. Filename sanitization strips path components and unsafe characters. System prompt instructs Claude to call save_document when the user wants a file (not chat text) and not to duplicate the file contents in the chat response — just write the file and tell Aaron where it landed. ingest.py and watcher.py now skip files under Drafts/ at ingest time so generated drafts don't pollute future retrieval. Drafts can still be opened, edited, and shipped; they just don't become part of the searchable corpus unless Aaron explicitly moves them out of Drafts/.	2026-05-20 00:41:26 +00:00
aaron	9955c7e383	encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak extract_blocks(filepath) is the new structured-extraction entry point, returning list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading prepended for retrieval context and stored in metadata). - pptx: one block per slide. Slide title becomes block heading; speaker notes fold into the body. Image-only decks with title-only slides now produce heading-only chunks instead of being recorded as extraction failures. - docx: deliberately single-block (back-compat). Heading-style section detection was implemented and rolled back: hand-formatted CVs are Normal-styled with bold-as-heading, and tying chunk boundaries to formatting choices would lock future-user into preserving those choices forever. Lexical + cross-encoder retrieval already handles substring matching inside blind-chunked CVs. - pdf/txt/md: unchanged (single block, blind chunking). Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it as secondary sort key in _rerank so memory/journal snapshots prefer the latest copy among near-duplicate content. reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a subset; previous hardcoded delete regex would have wiped both even with a single-ext target.	2026-05-19 21:58:25 +00:00
aaron	e38d283e59	watcher.py: exclude 3 image-only pptx files from ingestion Three files in the original ingest_failures cohort have been characterized via direct OCR and confirmed to lack ingestible text: - Presentations/Renders.pptx — 35 PICTURE-shape renders, 33/35 zero-char on OCR, 2 with noise (20 and 29 chars). - Presentations/Ribbon Cutting Slideshow.pptx — 10-slide event photo deck, 9/10 zero-char, 1 with 17 chars of noise. - Academic/DDF555 3D Computational/GH Slicer Notes [Autosaved].pptx — Office autosave duplicate of GH Slicer Notes.pptx; first 9 images byte-identical (sha256) to the canonical file. 2 net-new images contribute 36 noisy chars. Excluding to prevent double-embedding the same content under two source filenames. Pattern matches `f18fb64` (path.parts membership). Folder-level globs were considered and rejected: /Presentations/ contains successfully embedded text-bearing decks (aaronnelson_3D 4D.pptx, aaronnelson_slideslam.pptx). Exact-name + parent-folder membership applied in both watcher filter sites (get_changed_files and IngestHandler._should_ignore). The fourth file in the cohort, GH Slicer Notes.pptx (the canonical non-autosave version), was confirmed to carry 379 chars of real text (Grasshopper UI / code samples) across 6/9 images. It remains in ingest_failures unresolved, awaiting the eventual ocrmypdf backlog pass. Cleanup: 3 ingest_failures rows resolved (the excluded files). Unresolved count: 94 → 91. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:42:40 +00:00
aaron	b9eea6cb62	watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes Three rows in ingest_failures were Office lockfile leftovers whose filename starts with ~� (~ followed by the UTF-8 replacement character) instead of ~$. Somewhere in the Nextcloud sync chain the $ byte was lost or replaced; the file now lives on disk as a real file with this corrupted name. The watcher's ("~$", ".") prefix filter didn't match, so each cycle tried to ingest these as pptx, hit BadZipFile inside python-pptx (lockfiles aren't real Office documents), and they ended up permanently in ingest_failures. Three filter sites in watcher.py applied the lockfile prefix check: - ingest_file() at :127 - get_changed_files() at :200 - IngestHandler._should_ignore() at :290 All three now match ("~$", "~", ".") — broadened to catch any tilde prefix, not just ~$. The cross-check against pgvector embeddings and disk found zero legitimate tilde-prefixed files in the corpus, so the broader filter has no false-positive risk in this corpus. Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%'). Unresolved count drops 97 → 94. If a fourth filter site is ever added, the right shape is consolidating the lockfile prefix check to a shared function or constant. Three parallel sites with three different tuple orderings is acceptable for now but worth normalizing if the surface grows.	2026-05-04 16:19:56 +00:00
aaron	f18fb64fe5	watcher.py: exclude generative-graphic folders and zero-byte files Two-sample diagnostic of the 128 ingest_failures rows surfaced two folders whose contents are exclusively non-text PDFs (iText-produced generative graphics from Processing sketches and computational design sketches) and three zero-byte test artifacts. None of these have ever produced an embedding chunk, and they have nothing extractable to contribute. Excluding them removes 19 / 128 (15%) of the locked-out failures from the cohort and prevents future versions of the same patterns from re-failing. Folder exclusions use path.parts membership rather than substring matching — eliminates false-match risk if similarly-named folders appear elsewhere in the corpus (e.g. an unrelated "Generative Design" or "Computational Design 2017" directory created later). The existing "Admin/Backups" / "Journal/Media" substring checks are looser, but new exclusions take the tighter pattern. Zero-byte filter goes in get_changed_files() only — the actual ingestion gate. Adding stat() to _should_ignore() (the FS-event noise filter) would introduce a race where the file is gone between event fire and stat call. Empty files briefly trigger pending=True but produce no work after debounce; cosmetic only. Cleanup applied separately via UPDATE: 19 ingest_failures rows for these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110. Verified: get_changed_files() with empty state returns 1418 changed files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent from the result, control file present. Watcher service restarted clean; startup scan reports no missed files.	2026-05-04 06:24:08 +00:00
aaron	72e07afc03	watcher.py: do not mark failed ingests as successfully ingested ingest_files() updated state[path] = mtime unconditionally after every ingest_file() call. ingest_file() returns 0 when text extraction fails, embedding fails, no chunks are produced, or the pgvector write fails — in every one of those cases, the path was still recorded as ingested at the current mtime. On the next pass, get_changed_files() saw the mtime match and skipped the file, locking it out of the corpus until something modified it on disk. record_ingest_failure() writes to a UI-visible failures table, but nothing reads that table to retry. So failures accumulated silently: the file was simultaneously logged as failed AND tracked in watcher_state as up-to-date, and the second condition won. Fix: only update watcher_state when ingest_file returns count > 0. Failed ingests will be retried on the next watcher cycle until they succeed or are explicitly excluded. Diagnostic at fix time: 129 rows in ingest_failures, 128 currently locked out of the corpus (filepath in watcher_state with mtime matching current disk). 128/129 are text_extraction failures, mostly scanned PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer exists on disk. 0 have had their disk mtime change since failing — i.e. without this fix, none of them would ever retry. Cross-check shows watcher_state has 1466 paths vs. 1061 distinct sources in pgvector embeddings, leaving a residual silent-gap of ~276 files after accounting for failures. Historical cleanup of files already locked out by this bug is tracked separately. New failures from this commit forward will retry naturally.	2026-05-04 03:52:01 +00:00
aaron	1101bef226	scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11) Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.	2026-05-03 01:40:47 +00:00
aaron	465f2f725b	Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37 - api.py: strip CV pinning workaround (parity violation, see architecture doc) - dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches 3x and filters in-process. Was silently dropping the parameter; would have confounded E3 with broken cross-stage exclusion in Graphiti arm. - watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was propagating through entire cascade. Postgres TEXT can hold up to 1GB. - corpus_integrity.py: F37 — same truncation, third path now clean. Backups: api.py.bak., dream.py.bak., watcher.py.bak., ingest.py.bak., corpus_integrity.py.bak.* timestamped pre-fix. Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected by F14, 414KB).	2026-05-01 02:26:37 +00:00
aaron	74e2c34f43	corpus integrity: ingest_failures tracking in watcher, reconciliation script, corpus status/retry/reconcile endpoints	2026-04-30 21:54:39 +00:00
aaron	f11cacd9c9	add experiment scripts and results; watcher.py latest changes	2026-04-30 18:06:03 +00:00
aaron	2b3c2380a0	watcher.py: in-process ingest, embedder loaded once at startup, startup recovery, heartbeat, no duplicate logging	2026-04-30 16:42:44 +00:00
aaron	037d747573	chore: archive deprecated chromadb and migration scripts	2026-04-28 00:15:46 +00:00
aaron	d3239aba17	Image capture — extend /api/capture for image+voice, Claude vision description, Media/ WebDAV, watcher excludes Media/	2026-04-27 04:28:31 +00:00
aaron	187d31eaff	Fix watcher status indicator — write status file every 5s, API reads it directly	2026-04-25 16:58:19 +00:00
aaron	d765f9398b	Fix watcher timeout loop — exclude Backups folder, increase timeout to 30min	2026-04-25 16:44:13 +00:00
aaron	22ef40bbaa	Initial commit - Aaron AI v1	2026-04-25 02:05:42 +00:00

17 Commits