aaronAI

Author	SHA1	Message	Date
aaron	9bb083f065	chat: cap retrieve_documents per turn, truncate displayed citations, broaden lock-file skip - MAX_RETRIEVALS_PER_TURN (5): after five retrieve_documents calls in a single turn, further calls return a budget-exhausted message instead of executing. Caps cost on runaway multi-query loops without forbidding compound questions. - MAX_CITED_SOURCES (5): accumulated_sources was growing to 14+ entries across multiple tool calls and showing chunks Claude never actually used. Cap the list returned to the UI at 5, preserving insertion order so the highest-relevance early-call results survive. Proper fix (Claude-driven inline citations) is bigger work, noted for later. - ingest.py lock-file skip: changed prefix tuple from ("~$", ".") to ("~", ".") so it catches Office lock files even when Nextcloud's filesystem encoding has mangled the "$" into a unicode replacement char. Matches what watcher.py already does.	2026-05-20 02:22:54 +00:00
aaron	fda61ad622	api.py: save_document tool — pandoc render to Nextcloud Drafts/ via WebDAV Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he asks for a document (bio, cover letter, statement, CV section) rather than chat text. Pandoc handles markdown -> docx and markdown -> pdf with the xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision check appends _2, _3, ... until unique. Filename sanitization strips path components and unsafe characters. System prompt instructs Claude to call save_document when the user wants a file (not chat text) and not to duplicate the file contents in the chat response — just write the file and tell Aaron where it landed. ingest.py and watcher.py now skip files under Drafts/ at ingest time so generated drafts don't pollute future retrieval. Drafts can still be opened, edited, and shipped; they just don't become part of the searchable corpus unless Aaron explicitly moves them out of Drafts/.	2026-05-20 00:41:26 +00:00
aaron	9955c7e383	encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak extract_blocks(filepath) is the new structured-extraction entry point, returning list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading prepended for retrieval context and stored in metadata). - pptx: one block per slide. Slide title becomes block heading; speaker notes fold into the body. Image-only decks with title-only slides now produce heading-only chunks instead of being recorded as extraction failures. - docx: deliberately single-block (back-compat). Heading-style section detection was implemented and rolled back: hand-formatted CVs are Normal-styled with bold-as-heading, and tying chunk boundaries to formatting choices would lock future-user into preserving those choices forever. Lexical + cross-encoder retrieval already handles substring matching inside blind-chunked CVs. - pdf/txt/md: unchanged (single block, blind chunking). Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it as secondary sort key in _rerank so memory/journal snapshots prefer the latest copy among near-duplicate content. reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a subset; previous hardcoded delete regex would have wiped both even with a single-ext target.	2026-05-19 21:58:25 +00:00
aaron	1101bef226	scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11) Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.	2026-05-03 01:40:47 +00:00
aaron	465f2f725b	Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37 - api.py: strip CV pinning workaround (parity violation, see architecture doc) - dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches 3x and filters in-process. Was silently dropping the parameter; would have confounded E3 with broken cross-stage exclusion in Graphiti arm. - watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was propagating through entire cascade. Postgres TEXT can hold up to 1GB. - corpus_integrity.py: F37 — same truncation, third path now clean. Backups: api.py.bak., dream.py.bak., watcher.py.bak., ingest.py.bak., corpus_integrity.py.bak.* timestamped pre-fix. Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected by F14, 414KB).	2026-05-01 02:26:37 +00:00
aaron	2fb50cce71	ingest.py: guard Stage 2 enqueue behind SKIP_STAGE2_ENQUEUE env var for migration runs	2026-04-30 16:20:11 +00:00
aaron	2b9a1782c1	feat: stage2/3 pipeline, taxonomy-free cascade, E1.8/E4 experiments, corpus migration state	2026-04-30 04:04:31 +00:00
aaron	037d747573	chore: archive deprecated chromadb and migration scripts	2026-04-28 00:15:46 +00:00
aaron	6776637178	Remove hardcoded PG password fallbacks — require PG_DSN env var in all scripts	2026-04-27 05:16:37 +00:00
aaron	f78b83042b	Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, dream.py	2026-04-26 21:16:04 +00:00
aaron	d2eed98906	Pre-pgvector migration checkpoint — upsert, allow_replace_deleted, maintenance timer	2026-04-26 20:19:49 +00:00
aaron	22ef40bbaa	Initial commit - Aaron AI v1	2026-04-25 02:05:42 +00:00

12 Commits