watcher.py: do not mark failed ingests as successfully ingested

ingest_files() updated state[path] = mtime unconditionally after every
ingest_file() call. ingest_file() returns 0 when text extraction fails,
embedding fails, no chunks are produced, or the pgvector write fails —
in every one of those cases, the path was still recorded as ingested
at the current mtime. On the next pass, get_changed_files() saw the
mtime match and skipped the file, locking it out of the corpus until
something modified it on disk.

record_ingest_failure() writes to a UI-visible failures table, but
nothing reads that table to retry. So failures accumulated silently:
the file was simultaneously logged as failed AND tracked in
watcher_state as up-to-date, and the second condition won.

Fix: only update watcher_state when ingest_file returns count > 0.
Failed ingests will be retried on the next watcher cycle until they
succeed or are explicitly excluded.

Diagnostic at fix time: 129 rows in ingest_failures, 128 currently
locked out of the corpus (filepath in watcher_state with mtime matching
current disk). 128/129 are text_extraction failures, mostly scanned
PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer
exists on disk. 0 have had their disk mtime change since failing — i.e.
without this fix, none of them would ever retry. Cross-check shows
watcher_state has 1466 paths vs. 1061 distinct sources in pgvector
embeddings, leaving a residual silent-gap of ~276 files after
accounting for failures.

Historical cleanup of files already locked out by this bug is tracked
separately. New failures from this commit forward will retry naturally.
This commit is contained in:
2026-05-04 03:52:01 +00:00
parent c3011c80a5
commit 72e07afc03
+2 -1
View File
@@ -168,7 +168,8 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
for path in paths:
count = ingest_file(path, embedder)
total += count
state[str(path)] = str(path.stat().st_mtime)
if count > 0:
state[str(path)] = str(path.stat().st_mtime)
log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
return state