encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak

extract_blocks(filepath) is the new structured-extraction entry point, returning list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading prepended for retrieval context and stored in metadata). - pptx: one block per slide. Slide title becomes block heading; speaker notes fold into the body. Image-only decks with title-only slides now produce heading-only chunks instead of being recorded as extraction failures. - docx: deliberately single-block (back-compat). Heading-style section detection was implemented and rolled back: hand-formatted CVs are Normal-styled with bold-as-heading, and tying chunk boundaries to formatting choices would lock future-user into preserving those choices forever. Lexical + cross-encoder retrieval already handles substring matching inside blind-chunked CVs. - pdf/txt/md: unchanged (single block, blind chunking). Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it as secondary sort key in _rerank so memory/journal snapshots prefer the latest copy among near-duplicate content. reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a subset; previous hardcoded delete regex would have wiped both even with a single-ext target.
2026-05-19 21:58:25 +00:00
parent 50b97e2998
commit 9955c7e383
5 changed files with 187 additions and 69 deletions
@@ -15,7 +15,7 @@ from dotenv import load_dotenv
 import psycopg2
 from sentence_transformers import SentenceTransformer

-from encoding import extract_text, chunk_and_embed, write_embeddings_batch, SUPPORTED
+from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
 from failures import (
    record_ingest_failure as _record_failure_sql,
    resolve_ingest_failure as _resolve_failure_sql,
@@ -83,8 +83,11 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
        return 0
    if filepath.suffix.lower() not in SUPPORTED:
        return 0
-    text = extract_text(filepath)
-    if not text.strip():
+    blocks = extract_blocks(filepath)
+    if not blocks or not any(
+        (b.get("text") or "").strip() or (b.get("heading") or "").strip()
+        for b in blocks
+    ):
        _record_failure(filepath, "Text extraction failed or empty")
        return 0
    folder_rel = None
@@ -94,7 +97,7 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
        except ValueError:
            pass
    try:
-        rows = chunk_and_embed(text, filepath.name, embedder,
+        rows = chunk_and_embed(blocks, filepath.name, embedder,
                               filepath=filepath, folder=folder_rel)
    except Exception as e:
        _record_failure(filepath, f"Embedding failed: {e}")
@@ -113,7 +116,11 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
    print(f"  Indexed {len(rows)} chunks: {filepath.name}")
    _resolve_failure(filepath.name)
    if not os.getenv("SKIP_STAGE2_ENQUEUE"):
-        enqueue_stage2(filepath.name, text)
+        full_text = "\n".join(
+            f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
+            for b in blocks
+        )
+        enqueue_stage2(filepath.name, full_text)
    return len(rows)