aaronAI

T

aaron 8e61e4dedb docs: OCR install record for 2026-05-04

Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).

This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.

Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
  in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
  tried first but contains only rendered designs with no text — noted
  as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
  Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
  Real readable English; usable as the reference timing for the
  eventual async worker queue.

Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.

2026-05-04 16:58:30 +00:00

deprecated

chore: archive deprecated chromadb and migration scripts

2026-04-28 00:15:46 +00:00

docs

docs: OCR install record for 2026-05-04

2026-05-04 16:58:30 +00:00

experiments

embeddings: backfill type and created_at (Improvement #2 part A)

2026-05-03 23:58:53 +00:00

scripts

api.py: enable PRAGMA foreign_keys=ON in _connect helper; clean up 2 message orphans