scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)

Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.
2026-05-03 01:40:47 +00:00
parent a317df66f8
commit 1101bef226
6 changed files with 357 additions and 264 deletions
@@ -0,0 +1,120 @@
+"""
+Aaron AI Stage 1 encoding helpers — single canonical implementation of:
+  - extract_text(filepath) — four-extension text extraction
+  - chunk_text(text, chunk_size, overlap) — word-based chunking
+  - chunk_and_embed(text, source, embedder, filepath, folder) — produce ready-to-write rows
+  - write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
+
+Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
+Replaces four separate extract reimplementations and two extract-chunk-embed paths.
+"""
+
+import hashlib
+import json
+import logging
+from pathlib import Path
+
+from docx import Document as DocxDocument
+from pypdf import PdfReader
+from pptx import Presentation
+
+log = logging.getLogger("encoding")
+
+SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
+DEFAULT_CHUNK_SIZE = 500
+DEFAULT_CHUNK_OVERLAP = 50
+
+
+def extract_text(filepath: Path) -> str:
+    """Return the text of a supported file. Returns "" on any failure or
+    unsupported extension. Does not write to ingest_failures — caller decides."""
+    suffix = filepath.suffix.lower()
+    try:
+        if suffix == ".docx":
+            doc = DocxDocument(filepath)
+            return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
+        elif suffix == ".pdf":
+            reader = PdfReader(filepath)
+            return "".join(
+                page.extract_text() + "\n"
+                for page in reader.pages if page.extract_text()
+            )
+        elif suffix == ".pptx":
+            prs = Presentation(filepath)
+            return "\n".join(
+                shape.text for slide in prs.slides
+                for shape in slide.shapes
+                if hasattr(shape, "text") and shape.text.strip()
+            )
+        elif suffix in {".txt", ".md"}:
+            return filepath.read_text(encoding="utf-8", errors="ignore")
+    except Exception as e:
+        log.warning(f"Text extraction failed for {filepath.name}: {e}")
+    return ""
+
+
+def chunk_text(text: str,
+               chunk_size: int = DEFAULT_CHUNK_SIZE,
+               overlap: int = DEFAULT_CHUNK_OVERLAP) -> list[str]:
+    """Word-based chunking. Empty chunks filtered."""
+    words = text.split()
+    chunks = []
+    start = 0
+    while start < len(words):
+        chunk = " ".join(words[start:start + chunk_size])
+        if chunk.strip():
+            chunks.append(chunk)
+        start += chunk_size - overlap
+    return chunks
+
+
+def _chunk_id(filepath, source: str, index: int) -> str:
+    basis = str(filepath) if filepath else source
+    return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
+
+
+def chunk_and_embed(text: str,
+                    source: str,
+                    embedder,
+                    filepath=None,
+                    folder=None) -> list[dict]:
+    """Chunk text, embed each chunk, return rows ready for write_embeddings_batch."""
+    chunks = chunk_text(text)
+    if not chunks:
+        return []
+    embeddings = embedder.encode(chunks).tolist()
+    rows = []
+    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
+        rows.append({
+            "id": _chunk_id(filepath, source, i),
+            "document": chunk,
+            "embedding": emb,
+            "source": source,
+            "type": "document",
+            "metadata": {
+                "source": source,
+                "filepath": str(filepath) if filepath else source,
+                "folder": folder,
+            },
+        })
+    return rows
+
+
+def write_embeddings_batch(conn, batch: list[dict]) -> int:
+    """Single canonical INSERT. Sets created_at = NOW() server-side. Commits."""
+    if not batch:
+        return 0
+    cur = conn.cursor()
+    for row in batch:
+        cur.execute("""
+            INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
+            VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
+            ON CONFLICT (id) DO UPDATE SET
+                document   = EXCLUDED.document,
+                embedding  = EXCLUDED.embedding,
+                source     = EXCLUDED.source,
+                metadata   = EXCLUDED.metadata
+        """, (row["id"], row["document"], row["embedding"],
+              row["source"], row["type"], json.dumps(row["metadata"])))
+    conn.commit()
+    return len(batch)