Files
aaronAI/docs/ocr-install-2026-05-04.md
T
aaron 8e61e4dedb docs: OCR install record for 2026-05-04
Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).

This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.

Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
  in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
  tried first but contains only rendered designs with no text — noted
  as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
  Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
  Real readable English; usable as the reference timing for the
  eventual async worker queue.

Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.
2026-05-04 16:58:30 +00:00

4.7 KiB
Raw Blame History

OCR install record — 2026-05-04

Machine

  • Host: aaronai-01 (VPS)
  • OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)

apt packages installed

package version source
tesseract-ocr 5.3.4-1build5 noble
tesseract-ocr-eng 1:4.1.0-2 noble
tesseract-ocr-osd 1:4.1.0-2 noble (automatic)
libtesseract5 5.3.4-1build5 noble (automatic)

pip packages installed (into /home/aaron/aaronai/venv)

package version
pytesseract 0.3.13
ocrmypdf 17.4.2

Direct dependencies pulled in by the two installs above (also new in venv): pikepdf 10.5.1, pdfminer-six 20260107, pypdfium2 5.7.1, img2pdf 0.6.3, pi-heif 1.3.0, cryptography 47.0.0, cffi 2.0.0, pycparser 3.0, Deprecated 1.3.1, deprecation 2.1.0, defusedxml 0.7.1, fonttools 4.62.1, fpdf2 2.8.7, uharfbuzz 0.54.1, wrapt 2.1.2, pluggy 1.6.0. pillow was already at 12.2.0.

Smoke test 1 — tesseract --version

tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX512BW
 Found AVX512F

Smoke test 2 — tesseract --list-langs

List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
eng
osd

Smoke test 3 — pytesseract on a slide image

  • Input pptx: /home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx
  • Extracted image: ppt/media/image1.PNG (1768×504 PNG)
  • Wall-clock: 0.220s
  • Chars extracted: 126
  • First 200 chars:
Generates the Bounding Box for NESS

round(x, 4), round(y, 4), round(z, 4), round(a, 4))

Format ("HSS5 X(0} ¥(1} W(2} H(3)",

Note: the first image in Renders.pptx (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in Renders.pptx; all 15 are pure rendered designs/photographs with no text. Switched to GH Slicer Notes.pptx (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; Renders.pptx is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. ¥(1} for Y(1), mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.

Smoke test 4 — ocrmypdf on a Lexmark CX510de scan

  • Input PDF: /home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
  • Command: ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf
  • Wall-clock: 3.72s (whole PDF, 4 pages)
  • Exit: 0
  • After OCR, pdftotext on the output produced 2347 chars (2270 non-whitespace).
  • First 200 chars of OCR'd text:
nN New Paltz
STATE UNIVERSITY OF NEW YORK

The Honors Program

May 30, 2017

Dear Aaron,

Thank you for serving as a reader for Caryn Byllotts thesis on "Recall/Reconstruct: The Exploration of
Memory

Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.

Reference timing

operation input size wall-clock
pytesseract single image 1768×504 PNG 0.22s
ocrmypdf 4-page scan 4 pages, ~A4 3.72s (~0.93s/page)

Deferred — project dep-tracking

The project has no dependency manifest on disk: no requirements.txt, pyproject.toml, setup.py, Pipfile, or poetry.lock. Pip deps live only in venv/. The OCR install adds pytesseract and ocrmypdf (plus their transitive closure listed above) to that untracked venv state.

This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where import pytesseract will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.

Followups

  • Async OCR worker (separate session). Use the reference timing above to size the queue.
  • Capture path integration: phone-camera images → pytesseract.image_to_string → existing chunk/embed pipeline.
  • Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (Renders.pptx, Ribbon Cutting Slideshow.pptx, two GH Slicer Notes variants). Per the smoke results, Renders.pptx is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
  • Project dep-manifest decision (see Deferred section above).