watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes

Three rows in ingest_failures were Office lockfile leftovers whose
filename starts with ~� (~ followed by the UTF-8 replacement
character) instead of ~$. Somewhere in the Nextcloud sync chain the $
byte was lost or replaced; the file now lives on disk as a real file
with this corrupted name. The watcher's ("~$", ".") prefix filter
didn't match, so each cycle tried to ingest these as pptx, hit
BadZipFile inside python-pptx (lockfiles aren't real Office documents),
and they ended up permanently in ingest_failures.

Three filter sites in watcher.py applied the lockfile prefix check:
  - ingest_file() at :127
  - get_changed_files() at :200
  - IngestHandler._should_ignore() at :290

All three now match ("~$", "~", ".") — broadened to catch any tilde
prefix, not just ~$. The cross-check against pgvector embeddings and
disk found zero legitimate tilde-prefixed files in the corpus, so the
broader filter has no false-positive risk in this corpus.

Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%').
Unresolved count drops 97 → 94.

If a fourth filter site is ever added, the right shape is consolidating
the lockfile prefix check to a shared function or constant. Three
parallel sites with three different tuple orderings is acceptable for
now but worth normalizing if the surface grows.
This commit is contained in:
2026-05-04 16:19:56 +00:00
parent 93c0d89308
commit b9eea6cb62
+3 -3
View File
@@ -124,7 +124,7 @@ def resolve_ingest_failure(source: str):
def ingest_file(filepath: Path, embedder) -> int: def ingest_file(filepath: Path, embedder) -> int:
if filepath.name.startswith(("~$", ".")): if filepath.name.startswith(("~$", "~", ".")):
return 0 return 0
if filepath.suffix.lower() not in SUPPORTED: if filepath.suffix.lower() not in SUPPORTED:
return 0 return 0
@@ -197,7 +197,7 @@ def get_changed_files(state: dict) -> list:
continue continue
if path.suffix.lower() not in SUPPORTED: if path.suffix.lower() not in SUPPORTED:
continue continue
if path.name.startswith((".", "~$")): if path.name.startswith((".", "~$", "~")):
continue continue
if "Admin/Backups" in str(path) or "Backups" in path.parts: if "Admin/Backups" in str(path) or "Backups" in path.parts:
continue continue
@@ -287,7 +287,7 @@ class IngestHandler(FileSystemEventHandler):
self.last_event = 0 self.last_event = 0
def _should_ignore(self, path: Path) -> bool: def _should_ignore(self, path: Path) -> bool:
if path.name.startswith((".", "~$")): if path.name.startswith((".", "~$", "~")):
return True return True
if "Admin/Backups" in str(path) or "Backups" in path.parts: if "Admin/Backups" in str(path) or "Backups" in path.parts:
return True return True