aaron b09e35892c encoding.py: strip frontmatter from .md at extraction time
The capture endpoint (api.py:702, 833) writes Journal/Captures/*.md
files with a markdown-bold-style header block (`**type:** voice`,
`**modality:** audio`, `**status:** unprocessed`, optional `**media:**`
and `**project:**`) followed by a `---` separator. extract_text for .md
was a bare filepath.read_text, so every capture-derived chunk in
pgvector embedded the frontmatter as raw text, polluting retrieval.

Fix adds _strip_md_frontmatter, called only for the .md branch:

- Capture-style: optional leading H1 (preserved), then consecutive
  `**key:** value` lines (and blanks), terminated by `---`. The H1 is
  retained; the key/value block + separator are removed.
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
  Only triggered when no heading precedes — guards against the common
  `# Title` + `---` (horizontal rule under heading) pattern seen in
  Journal/aaronai-architecture.md and four other Journal/*.md files.

Body `**bold:**` lines (e.g. `**Visual description:**` in image
captures) and body `---` horizontal rules are never touched: the scan
aborts as soon as a non-frontmatter line appears in the leading block.

briefing_generator_v2.py's split("---", 1) heuristic was reviewed and
not reused — fragile on substring matches and on documents with
multiple `---` rules.

Verified against:
- 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1
  retained.
- 2026-04-27-04-34-image.md: frontmatter stripped, `**Visual
  description:**` and `**Voice annotation:**` body bold-headers
  retained, trailing `---` not consumed.
- Journal/aaronai-architecture.md (5 body `---` rules): output
  byte-identical to read_text (96101 chars).
- Synthetic YAML doc: stripped correctly when no leading heading.
- Synthetic plain markdown with body `---` rules: untouched.
- Empty input + heading-only file: untouched.

Existing capture chunks in pgvector retain polluted text; the fix only
affects future extractions. Backfill decision deferred — the cleanest
path is `touch -h Journal/Captures/*.md` to bump mtime and let the
watcher re-ingest naturally on the next cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:20:55 +00:00
2026-04-25 02:05:42 +00:00
S
Description
No description provided
12 MiB
Languages
Python 95.9%
HTML 3.7%
Shell 0.4%