aaronAI

T

aaron 9955c7e383 encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak

extract_blocks(filepath) is the new structured-extraction entry point, returning
list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk
back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading
prepended for retrieval context and stored in metadata).

- pptx: one block per slide. Slide title becomes block heading; speaker notes
  fold into the body. Image-only decks with title-only slides now produce
  heading-only chunks instead of being recorded as extraction failures.
- docx: deliberately single-block (back-compat). Heading-style section detection
  was implemented and rolled back: hand-formatted CVs are Normal-styled with
  bold-as-heading, and tying chunk boundaries to formatting choices would lock
  future-user into preserving those choices forever. Lexical + cross-encoder
  retrieval already handles substring matching inside blind-chunked CVs.
- pdf/txt/md: unchanged (single block, blind chunking).

Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it
as secondary sort key in _rerank so memory/journal snapshots prefer the latest
copy among near-duplicate content.

reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a
subset; previous hardcoded delete regex would have wiped both even with a
single-ext target.

2026-05-19 21:58:25 +00:00

deprecated

chore: archive deprecated chromadb and migration scripts

2026-04-28 00:15:46 +00:00

docs

docs: OCR install record for 2026-05-04

2026-05-04 16:58:30 +00:00

experiments

embeddings: backfill type and created_at (Improvement #2 part A)

2026-05-03 23:58:53 +00:00

scripts

encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak