313c0f0341c2c3e0a025cd9850111ec045a810a9
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9955c7e383 |
encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak
extract_blocks(filepath) is the new structured-extraction entry point, returning
list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk
back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading
prepended for retrieval context and stored in metadata).
- pptx: one block per slide. Slide title becomes block heading; speaker notes
fold into the body. Image-only decks with title-only slides now produce
heading-only chunks instead of being recorded as extraction failures.
- docx: deliberately single-block (back-compat). Heading-style section detection
was implemented and rolled back: hand-formatted CVs are Normal-styled with
bold-as-heading, and tying chunk boundaries to formatting choices would lock
future-user into preserving those choices forever. Lexical + cross-encoder
retrieval already handles substring matching inside blind-chunked CVs.
- pdf/txt/md: unchanged (single block, blind chunking).
Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it
as secondary sort key in _rerank so memory/journal snapshots prefer the latest
copy among near-duplicate content.
reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a
subset; previous hardcoded delete regex would have wiped both even with a
single-ext target.
|
||
|
|
5b4a299414 |
encoding.py: write_embeddings_batch accepts commit parameter for transactional composition
Adds an optional commit=True parameter to write_embeddings_batch. When True (default, matching prior behavior), the function commits the connection after the per-row UPSERT loop. When False, the caller manages the transaction. This unblocks fix #1 (pgvector-bypass paths) and fix #2 (watcher two-transaction pattern), both of which need to compose embeddings writes with other database writes in the same transaction. Without this lever, either fix would require duplicating the UPSERT logic outside this helper or introducing a second commit boundary inside an otherwise atomic operation. No behavior change for existing callers — they all use the default commit=True and continue working unchanged. |
||
|
|
b09e35892c |
encoding.py: strip frontmatter from .md at extraction time
The capture endpoint (api.py:702, 833) writes Journal/Captures/*.md
files with a markdown-bold-style header block (`**type:** voice`,
`**modality:** audio`, `**status:** unprocessed`, optional `**media:**`
and `**project:**`) followed by a `---` separator. extract_text for .md
was a bare filepath.read_text, so every capture-derived chunk in
pgvector embedded the frontmatter as raw text, polluting retrieval.
Fix adds _strip_md_frontmatter, called only for the .md branch:
- Capture-style: optional leading H1 (preserved), then consecutive
`**key:** value` lines (and blanks), terminated by `---`. The H1 is
retained; the key/value block + separator are removed.
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
Only triggered when no heading precedes — guards against the common
`# Title` + `---` (horizontal rule under heading) pattern seen in
Journal/aaronai-architecture.md and four other Journal/*.md files.
Body `**bold:**` lines (e.g. `**Visual description:**` in image
captures) and body `---` horizontal rules are never touched: the scan
aborts as soon as a non-frontmatter line appears in the leading block.
briefing_generator_v2.py's split("---", 1) heuristic was reviewed and
not reused — fragile on substring matches and on documents with
multiple `---` rules.
Verified against:
- 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1
retained.
- 2026-04-27-04-34-image.md: frontmatter stripped, `**Visual
description:**` and `**Voice annotation:**` body bold-headers
retained, trailing `---` not consumed.
- Journal/aaronai-architecture.md (5 body `---` rules): output
byte-identical to read_text (96101 chars).
- Synthetic YAML doc: stripped correctly when no leading heading.
- Synthetic plain markdown with body `---` rules: untouched.
- Empty input + heading-only file: untouched.
Existing capture chunks in pgvector retain polluted text; the fix only
affects future extractions. Backfill decision deferred — the cleanest
path is `touch -h Journal/Captures/*.md` to bump mtime and let the
watcher re-ingest naturally on the next cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
93c0d89308 |
encoding.py: extend docx and pptx extractors to walk tables, headers/footers, text-boxes, group shapes, and notes
The previous extractors walked only top-level body paragraphs (docx) and top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text" ingest failures revealed that 13 docx files in the failure cohort have 100% of their content in tables (paras_with_text=0, table_cells=6-108). These are syllabi, rosters, rubrics, and homework worksheets structured as a single document-wide table — high-value academic content the corpus was silently missing. docx walker now covers: - body paragraphs (existing) - tables, including nested tables in cells (recursive helper) - header and footer paragraphs per section - text-box content via XPath against w:txbxContent (no first-class API in python-docx; future-proofing — none of the current failure cohort has text-boxes) pptx walker now covers: - top-level shape text (existing) - recursive descent into group shapes - table cell text via shape.has_table / shape.table.iter_cells() - speaker notes via slide.notes_slide.notes_text_frame.text Out of scope: SmartArt diagrams, chart titles/labels, OLE objects, content controls. None of the current failure cohort has these. Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two GH Slicer Notes variants — all PICTURE-shape decks with no text in any walkable structure). They stay in ingest_failures unresolved, awaiting OCR or path exclusion. Side effect worth noting: the regression check on 4 known-good files that were already producing embeddings showed all four gained content under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars (+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew 22,259 to 23,546 (+1,287). These files have been in the corpus all along with table or notes content silently dropped. They will surface the additional content on next re-ingest, improving retrieval quality for any future query that touches them. Cleanup: ingest_file already calls resolve_ingest_failure on successful ingest, so the 13 recovered files were marked resolved=TRUE during the retry pass. No separate cleanup SQL was needed. |
||
|
|
7c7b649775 |
embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)
Writers now enforce type and created_at:
- encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
re-ingest re-classifies type but does not overwrite a backfilled mtime.
- ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
convo.updated_at; re-runs should refresh).
- Column-level NOT NULL is not added; application-layer raise gives a
faster, more debuggable failure than a Postgres constraint error.
Retrieval propagates type into chunks:
- retrieve() SELECT now includes type; chunk dicts carry "type": etype.
- WHERE clause built dynamically from excluded_sources and the new
--type-filter CLI arg (experimental, default None, pgvector retrieval
only — Graphiti chunks have no embeddings.type to filter on).
- retrieve_graphiti unchanged; its chunks lack the type field.
Manifests carry type_distribution per stage:
- dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
early_rem, late_rem — a Counter over chunk types, filtering None so
Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
distribution. Pgvector chunks always carry type post-backfill; if None
appears, the backfill or writer enforcement has regressed.
Verification:
B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
all 84 chunks preserved created_at=2026-04-27T06:11:55Z
B2 missing-type assertion raises ValueError, no row leaked to embeddings
B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
type_filter only, excl 2 elems, excl 1 elem edge case, both};
all five plans use HNSW index scan with correct Filter clauses
C1 retrieve("nrem") returns 8 chunks each carrying "type" key
C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
no single type >=90%)
The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
|
||
|
|
1101bef226 |
scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)
Consolidates four extract paths and two extract-chunk-embed-write pipelines into a single shared encoding module. Fixes the embedder lifecycle divergence between watcher and /api/reindex (no more 200MB reload per reindex click) and unifies failure tracking so /api/reindex failures now surface in SettingsPanel "Ingest Health". New files: - scripts/encoding.py — extract_text, chunk_text, chunk_and_embed, write_embeddings_batch - scripts/failures.py — record_ingest_failure, resolve_ingest_failure (shared by watcher.py and ingest.py) Refactored: - scripts/watcher.py — drops local extract/chunk/embed implementations and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding and failures. Now writes ingest_failures row on empty-text-extract (was silent return 0). - scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder, embedder=None) for in-process invocation; CLI back-compat preserved via ingest_folder wrapper. Module-level SentenceTransformer load removed. - scripts/corpus_integrity.py — imports extract_text from encoding; extract_text_for_retry function removed. - scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses module-level embedder; no subprocess); new /api/reindex/status endpoint reading ~/aaronai/reindex_status.json; /api/corpus/retry imports extract_text from encoding; INGEST_SCRIPT constant removed (dead after this refactor); 409 reentrance guard prevents double-click stomping. Behavior changes: - /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks threadpool, doesn't block API thread. - /api/reindex no longer reloads SentenceTransformer on each click. - /api/reindex failures newly write to ingest_failures (visible in SettingsPanel "Ingest Health" — badge will jump on first reindex). - New embeddings rows always have created_at = NOW() (canonical, server-side). - New embeddings rows always include metadata.folder field (None when not derivable). - /api/reindex returns 409 on second click while a job is running. - New /api/reindex/status endpoint for polling. Existing 9,815 NULL created_at rows remain unchanged; backfill is a separate decision if desired. 199 insertions, 256 deletions across 6 files (codebase shrinks net). Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11). Pre-commit verification: BackgroundTasks already imported, sys.path resolves correctly via script-path semantics, static import clean. |