watcher.py: exclude generative-graphic folders and zero-byte files
Two-sample diagnostic of the 128 ingest_failures rows surfaced two folders whose contents are exclusively non-text PDFs (iText-produced generative graphics from Processing sketches and computational design sketches) and three zero-byte test artifacts. None of these have ever produced an embedding chunk, and they have nothing extractable to contribute. Excluding them removes 19 / 128 (15%) of the locked-out failures from the cohort and prevents future versions of the same patterns from re-failing. Folder exclusions use path.parts membership rather than substring matching — eliminates false-match risk if similarly-named folders appear elsewhere in the corpus (e.g. an unrelated "Generative Design" or "Computational Design 2017" directory created later). The existing "Admin/Backups" / "Journal/Media" substring checks are looser, but new exclusions take the tighter pattern. Zero-byte filter goes in get_changed_files() only — the actual ingestion gate. Adding stat() to _should_ignore() (the FS-event noise filter) would introduce a race where the file is gone between event fire and stat call. Empty files briefly trigger pending=True but produce no work after debounce; cosmetic only. Cleanup applied separately via UPDATE: 19 ingest_failures rows for these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110. Verified: get_changed_files() with empty state returns 1418 changed files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent from the result, control file present. Watcher service restarted clean; startup scan reports no missed files.
This commit is contained in:
@@ -203,6 +203,12 @@ def get_changed_files(state: dict) -> list:
|
|||||||
continue
|
continue
|
||||||
if "Journal/Media" in str(path):
|
if "Journal/Media" in str(path):
|
||||||
continue
|
continue
|
||||||
|
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||||
|
continue
|
||||||
|
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||||
|
continue
|
||||||
|
if path.stat().st_size == 0:
|
||||||
|
continue
|
||||||
if state.get(str(path)) != str(path.stat().st_mtime):
|
if state.get(str(path)) != str(path.stat().st_mtime):
|
||||||
changed.append(path)
|
changed.append(path)
|
||||||
return changed
|
return changed
|
||||||
@@ -287,6 +293,10 @@ class IngestHandler(FileSystemEventHandler):
|
|||||||
return True
|
return True
|
||||||
if "Journal/Media" in str(path):
|
if "Journal/Media" in str(path):
|
||||||
return True
|
return True
|
||||||
|
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||||
|
return True
|
||||||
|
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||||
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def on_created(self, event):
|
def on_created(self, event):
|
||||||
|
|||||||
Reference in New Issue
Block a user