Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).
This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.
Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
tried first but contains only rendered designs with no text — noted
as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
Real readable English; usable as the reference timing for the
eventual async worker queue.
Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.