Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng). Python wrappers added to venv (pip: pytesseract, ocrmypdf). This commit is the install record only. No code change — async OCR worker, capture path integration, and backlog processing are separate followups. Smoke test results captured in the file: - pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was tried first but contains only rendered designs with no text — noted as a likely candidate for exclusion rather than OCR). - ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page). Real readable English; usable as the reference timing for the eventual async worker queue. Deferred decision: project has no dependency manifest (no requirements.txt, pyproject.toml, etc). Tracking that as its own followup rather than bolting it onto this install. The capture-path integration commit will be the natural point to address it if it hasn't been resolved by then.
4.7 KiB
OCR install record — 2026-05-04
Machine
- Host: aaronai-01 (VPS)
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
apt packages installed
| package | version | source |
|---|---|---|
| tesseract-ocr | 5.3.4-1build5 | noble |
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
pip packages installed (into /home/aaron/aaronai/venv)
| package | version |
|---|---|
| pytesseract | 0.3.13 |
| ocrmypdf | 17.4.2 |
Direct dependencies pulled in by the two installs above (also new in venv): pikepdf 10.5.1, pdfminer-six 20260107, pypdfium2 5.7.1, img2pdf 0.6.3, pi-heif 1.3.0, cryptography 47.0.0, cffi 2.0.0, pycparser 3.0, Deprecated 1.3.1, deprecation 2.1.0, defusedxml 0.7.1, fonttools 4.62.1, fpdf2 2.8.7, uharfbuzz 0.54.1, wrapt 2.1.2, pluggy 1.6.0. pillow was already at 12.2.0.
Smoke test 1 — tesseract --version
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX512BW
Found AVX512F
Smoke test 2 — tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
eng
osd
Smoke test 3 — pytesseract on a slide image
- Input pptx:
/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx - Extracted image:
ppt/media/image1.PNG(1768×504 PNG) - Wall-clock: 0.220s
- Chars extracted: 126
- First 200 chars:
Generates the Bounding Box for NESS
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
Note: the first image in Renders.pptx (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in Renders.pptx; all 15 are pure rendered designs/photographs with no text. Switched to GH Slicer Notes.pptx (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; Renders.pptx is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. ¥(1} for Y(1), mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
- Input PDF:
/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf(4 pages, Producer: Lexmark CX510de, Creator: HardCopy) - Command:
ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf - Wall-clock: 3.72s (whole PDF, 4 pages)
- Exit: 0
- After OCR,
pdftotexton the output produced 2347 chars (2270 non-whitespace). - First 200 chars of OCR'd text:
nN New Paltz
STATE UNIVERSITY OF NEW YORK
The Honors Program
May 30, 2017
Dear Aaron,
Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
Memory
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
Reference timing
| operation | input size | wall-clock |
|---|---|---|
| pytesseract single image | 1768×504 PNG | 0.22s |
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
Deferred — project dep-tracking
The project has no dependency manifest on disk: no requirements.txt, pyproject.toml, setup.py, Pipfile, or poetry.lock. Pip deps live only in venv/. The OCR install adds pytesseract and ocrmypdf (plus their transitive closure listed above) to that untracked venv state.
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where import pytesseract will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
Followups
- Async OCR worker (separate session). Use the reference timing above to size the queue.
- Capture path integration: phone-camera images →
pytesseract.image_to_string→ existing chunk/embed pipeline. - Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (
Renders.pptx,Ribbon Cutting Slideshow.pptx, twoGH Slicer Notesvariants). Per the smoke results,Renders.pptxis unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing. - Project dep-manifest decision (see Deferred section above).