From 72e07afc03ec2439fde47e72d17afea358bb69b4 Mon Sep 17 00:00:00 2001
From: Aaron Nelson <aaron@aaronnelson.studio>
Date: Mon, 4 May 2026 03:52:01 +0000
Subject: [PATCH] watcher.py: do not mark failed ingests as successfully
 ingested
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

ingest_files() updated state[path] = mtime unconditionally after every
ingest_file() call. ingest_file() returns 0 when text extraction fails,
embedding fails, no chunks are produced, or the pgvector write fails —
in every one of those cases, the path was still recorded as ingested
at the current mtime. On the next pass, get_changed_files() saw the
mtime match and skipped the file, locking it out of the corpus until
something modified it on disk.

record_ingest_failure() writes to a UI-visible failures table, but
nothing reads that table to retry. So failures accumulated silently:
the file was simultaneously logged as failed AND tracked in
watcher_state as up-to-date, and the second condition won.

Fix: only update watcher_state when ingest_file returns count > 0.
Failed ingests will be retried on the next watcher cycle until they
succeed or are explicitly excluded.

Diagnostic at fix time: 129 rows in ingest_failures, 128 currently
locked out of the corpus (filepath in watcher_state with mtime matching
current disk). 128/129 are text_extraction failures, mostly scanned
PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer
exists on disk. 0 have had their disk mtime change since failing — i.e.
without this fix, none of them would ever retry. Cross-check shows
watcher_state has 1466 paths vs. 1061 distinct sources in pgvector
embeddings, leaving a residual silent-gap of ~276 files after
accounting for failures.

Historical cleanup of files already locked out by this bug is tracked
separately. New failures from this commit forward will retry naturally.
---
 scripts/watcher.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/watcher.py b/scripts/watcher.py
index 506a560..01adb4b 100644
--- a/scripts/watcher.py
+++ b/scripts/watcher.py
@@ -168,7 +168,8 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
     for path in paths:
         count = ingest_file(path, embedder)
         total += count
-        state[str(path)] = str(path.stat().st_mtime)
+        if count > 0:
+            state[str(path)] = str(path.stat().st_mtime)
     log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
     return state