scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)
Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean.
This commit is contained in:
@@ -23,6 +23,9 @@ from datetime import datetime
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from encoding import extract_text
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
NEXTCLOUD_PATH = "/home/aaron/nextcloud/data/data/aaron/files"
|
||||
@@ -103,28 +106,6 @@ def get_ingest_failures():
|
||||
return failures
|
||||
|
||||
|
||||
def extract_text_for_retry(filepath):
|
||||
path = Path(filepath)
|
||||
suffix = path.suffix.lower()
|
||||
try:
|
||||
if suffix == ".docx":
|
||||
from docx import Document as D
|
||||
return "\n".join(p.text for p in D(path).paragraphs if p.text.strip())
|
||||
elif suffix == ".pdf":
|
||||
from pypdf import PdfReader
|
||||
return "".join(p.extract_text() + "\n" for p in PdfReader(path).pages if p.extract_text())
|
||||
elif suffix == ".pptx":
|
||||
from pptx import Presentation
|
||||
prs = Presentation(path)
|
||||
return "\n".join(shape.text for slide in prs.slides for shape in slide.shapes
|
||||
if hasattr(shape, "text") and shape.text.strip())
|
||||
elif suffix in {".txt", ".md"}:
|
||||
return path.read_text(encoding="utf-8", errors="ignore")
|
||||
except Exception as e:
|
||||
print(f"WARNING: extraction failed {path.name}: {e}", file=sys.stderr)
|
||||
return ""
|
||||
|
||||
|
||||
def queue_for_retry(source, full_text, filepath):
|
||||
try:
|
||||
pg = get_pg()
|
||||
@@ -188,7 +169,7 @@ def run_reconciliation(fix=False):
|
||||
if fix and neither:
|
||||
print(f"Auto-queuing {len(neither)} gap files...")
|
||||
for finfo in neither:
|
||||
text = extract_text_for_retry(finfo["filepath"])
|
||||
text = extract_text(Path(finfo["filepath"]))
|
||||
if text.strip():
|
||||
if queue_for_retry(finfo["source"], text, finfo["filepath"]):
|
||||
auto_queued.append(finfo["source"])
|
||||
|
||||
Reference in New Issue
Block a user