watcher.py: exclude generative-graphic folders and zero-byte files

Two-sample diagnostic of the 128 ingest_failures rows surfaced two
folders whose contents are exclusively non-text PDFs (iText-produced
generative graphics from Processing sketches and computational design
sketches) and three zero-byte test artifacts. None of these have ever
produced an embedding chunk, and they have nothing extractable to
contribute. Excluding them removes 19 / 128 (15%) of the locked-out
failures from the cohort and prevents future versions of the same
patterns from re-failing.

Folder exclusions use path.parts membership rather than substring
matching — eliminates false-match risk if similarly-named folders
appear elsewhere in the corpus (e.g. an unrelated "Generative Design"
or "Computational Design 2017" directory created later). The existing
"Admin/Backups" / "Journal/Media" substring checks are looser, but
new exclusions take the tighter pattern.

Zero-byte filter goes in get_changed_files() only — the actual
ingestion gate. Adding stat() to _should_ignore() (the FS-event noise
filter) would introduce a race where the file is gone between event
fire and stat call. Empty files briefly trigger pending=True but
produce no work after debounce; cosmetic only.

Cleanup applied separately via UPDATE: 19 ingest_failures rows for
these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110.

Verified: get_changed_files() with empty state returns 1418 changed
files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent
from the result, control file present. Watcher service restarted
clean; startup scan reports no missed files.
This commit is contained in:
2026-05-04 06:24:08 +00:00
parent 72e07afc03
commit f18fb64fe5
+10
View File
@@ -203,6 +203,12 @@ def get_changed_files(state: dict) -> list:
continue continue
if "Journal/Media" in str(path): if "Journal/Media" in str(path):
continue continue
if "Generative Design" in path.parts and "Processing" in path.parts:
continue
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
continue
if path.stat().st_size == 0:
continue
if state.get(str(path)) != str(path.stat().st_mtime): if state.get(str(path)) != str(path.stat().st_mtime):
changed.append(path) changed.append(path)
return changed return changed
@@ -287,6 +293,10 @@ class IngestHandler(FileSystemEventHandler):
return True return True
if "Journal/Media" in str(path): if "Journal/Media" in str(path):
return True return True
if "Generative Design" in path.parts and "Processing" in path.parts:
return True
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
return True
return False return False
def on_created(self, event): def on_created(self, event):