From 72e07afc03ec2439fde47e72d17afea358bb69b4 Mon Sep 17 00:00:00 2001 From: Aaron Nelson Date: Mon, 4 May 2026 03:52:01 +0000 Subject: [PATCH] watcher.py: do not mark failed ingests as successfully ingested MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ingest_files() updated state[path] = mtime unconditionally after every ingest_file() call. ingest_file() returns 0 when text extraction fails, embedding fails, no chunks are produced, or the pgvector write fails — in every one of those cases, the path was still recorded as ingested at the current mtime. On the next pass, get_changed_files() saw the mtime match and skipped the file, locking it out of the corpus until something modified it on disk. record_ingest_failure() writes to a UI-visible failures table, but nothing reads that table to retry. So failures accumulated silently: the file was simultaneously logged as failed AND tracked in watcher_state as up-to-date, and the second condition won. Fix: only update watcher_state when ingest_file returns count > 0. Failed ingests will be retried on the next watcher cycle until they succeed or are explicitly excluded. Diagnostic at fix time: 129 rows in ingest_failures, 128 currently locked out of the corpus (filepath in watcher_state with mtime matching current disk). 128/129 are text_extraction failures, mostly scanned PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer exists on disk. 0 have had their disk mtime change since failing — i.e. without this fix, none of them would ever retry. Cross-check shows watcher_state has 1466 paths vs. 1061 distinct sources in pgvector embeddings, leaving a residual silent-gap of ~276 files after accounting for failures. Historical cleanup of files already locked out by this bug is tracked separately. New failures from this commit forward will retry naturally. --- scripts/watcher.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/scripts/watcher.py b/scripts/watcher.py index 506a560..01adb4b 100644 --- a/scripts/watcher.py +++ b/scripts/watcher.py @@ -168,7 +168,8 @@ def ingest_files(paths: list, embedder, state: dict) -> dict: for path in paths: count = ingest_file(path, embedder) total += count - state[str(path)] = str(path.stat().st_mtime) + if count > 0: + state[str(path)] = str(path.stat().st_mtime) log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.") return state