Files
aaronAI/scripts/failures.py
T
aaron 1101bef226 scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)
Consolidates four extract paths and two extract-chunk-embed-write pipelines
into a single shared encoding module. Fixes the embedder lifecycle
divergence between watcher and /api/reindex (no more 200MB reload per
reindex click) and unifies failure tracking so /api/reindex failures now
surface in SettingsPanel "Ingest Health".

New files:
- scripts/encoding.py — extract_text, chunk_text, chunk_and_embed,
  write_embeddings_batch
- scripts/failures.py — record_ingest_failure, resolve_ingest_failure
  (shared by watcher.py and ingest.py)

Refactored:
- scripts/watcher.py — drops local extract/chunk/embed implementations
  and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding
  and failures. Now writes ingest_failures row on empty-text-extract
  (was silent return 0).
- scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder,
  embedder=None) for in-process invocation; CLI back-compat preserved via
  ingest_folder wrapper. Module-level SentenceTransformer load removed.
- scripts/corpus_integrity.py — imports extract_text from encoding;
  extract_text_for_retry function removed.
- scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses
  module-level embedder; no subprocess); new /api/reindex/status endpoint
  reading ~/aaronai/reindex_status.json; /api/corpus/retry imports
  extract_text from encoding; INGEST_SCRIPT constant removed (dead after
  this refactor); 409 reentrance guard prevents double-click stomping.

Behavior changes:
- /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks
  threadpool, doesn't block API thread.
- /api/reindex no longer reloads SentenceTransformer on each click.
- /api/reindex failures newly write to ingest_failures (visible in
  SettingsPanel "Ingest Health" — badge will jump on first reindex).
- New embeddings rows always have created_at = NOW() (canonical, server-side).
- New embeddings rows always include metadata.folder field (None when not
  derivable).
- /api/reindex returns 409 on second click while a job is running.
- New /api/reindex/status endpoint for polling.

Existing 9,815 NULL created_at rows remain unchanged; backfill is a
separate decision if desired.

199 insertions, 256 deletions across 6 files (codebase shrinks net).

Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11).
Pre-commit verification: BackgroundTasks already imported, sys.path
resolves correctly via script-path semantics, static import clean.
2026-05-03 01:40:47 +00:00

31 lines
1.2 KiB
Python

"""
Aaron AI ingest_failures helpers — shared by watcher.py and ingest.py.
Both modules write structured failure rows so the SettingsPanel "Ingest Health"
view sees the same shape regardless of ingest path. Functions take an explicit
conn parameter; the caller decides transaction boundaries and exception
handling. Both current callers wrap with their own log-and-swallow shims.
"""
def record_ingest_failure(conn, source: str, filepath, error: str) -> None:
"""Insert or update an ingest_failures row. Commits."""
cur = conn.cursor()
cur.execute("""
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
VALUES (%s, %s, %s, 0, NOW(), NOW())
ON CONFLICT (source) DO UPDATE SET
error = EXCLUDED.error,
retry_count = ingest_failures.retry_count + 1,
last_failed_at = NOW(),
resolved = FALSE
""", (source, str(filepath), error[:1000]))
conn.commit()
def resolve_ingest_failure(conn, source: str) -> None:
"""Mark a previously failed source as resolved. Commits."""
cur = conn.cursor()
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,))
conn.commit()