Files

T

aaron 8e61e4dedb docs: OCR install record for 2026-05-04

Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).

This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.

Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
  in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
  tried first but contains only rendered designs with no text — noted
  as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
  Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
  Real readable English; usable as the reference timing for the
  eventual async worker queue.

Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.

2026-05-04 16:58:30 +00:00

4.7 KiB

Raw Blame History

OCR install record — 2026-05-04

Machine

Host: aaronai-01 (VPS)
OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)

apt packages installed

package	version	source
tesseract-ocr	5.3.4-1build5	noble
tesseract-ocr-eng	1:4.1.0-2	noble
tesseract-ocr-osd	1:4.1.0-2	noble (automatic)
libtesseract5	5.3.4-1build5	noble (automatic)

pip packages installed (into /home/aaron/aaronai/venv)

package	version
pytesseract	0.3.13
ocrmypdf	17.4.2

Direct dependencies pulled in by the two installs above (also new in venv): pikepdf 10.5.1, pdfminer-six 20260107, pypdfium2 5.7.1, img2pdf 0.6.3, pi-heif 1.3.0, cryptography 47.0.0, cffi 2.0.0, pycparser 3.0, Deprecated 1.3.1, deprecation 2.1.0, defusedxml 0.7.1, fonttools 4.62.1, fpdf2 2.8.7, uharfbuzz 0.54.1, wrapt 2.1.2, pluggy 1.6.0. pillow was already at 12.2.0.

Smoke test 1 — `tesseract --version`

tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX512BW
 Found AVX512F

Smoke test 2 — `tesseract --list-langs`

List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
eng
osd

Smoke test 3 — pytesseract on a slide image

Input pptx: /home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx
Extracted image: ppt/media/image1.PNG (1768×504 PNG)
Wall-clock: 0.220s
Chars extracted: 126
First 200 chars:

Generates the Bounding Box for NESS

round(x, 4), round(y, 4), round(z, 4), round(a, 4))

Format ("HSS5 X(0} ¥(1} W(2} H(3)",

Note: the first image in Renders.pptx (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in Renders.pptx; all 15 are pure rendered designs/photographs with no text. Switched to GH Slicer Notes.pptx (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; Renders.pptx is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. ¥(1} for Y(1), mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.

Smoke test 4 — ocrmypdf on a Lexmark CX510de scan

Input PDF: /home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
Command: ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf
Wall-clock: 3.72s (whole PDF, 4 pages)
Exit: 0
After OCR, pdftotext on the output produced 2347 chars (2270 non-whitespace).
First 200 chars of OCR'd text:

nN New Paltz
STATE UNIVERSITY OF NEW YORK

The Honors Program

May 30, 2017

Dear Aaron,

Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
Memory

Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.

Reference timing

operation	input size	wall-clock
pytesseract single image	1768×504 PNG	0.22s
ocrmypdf 4-page scan	4 pages, ~A4	3.72s (~0.93s/page)

Deferred — project dep-tracking

The project has no dependency manifest on disk: no requirements.txt, pyproject.toml, setup.py, Pipfile, or poetry.lock. Pip deps live only in venv/. The OCR install adds pytesseract and ocrmypdf (plus their transitive closure listed above) to that untracked venv state.

This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where import pytesseract will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.

Followups

Async OCR worker (separate session). Use the reference timing above to size the queue.
Capture path integration: phone-camera images → pytesseract.image_to_string → existing chunk/embed pipeline.
Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (Renders.pptx, Ribbon Cutting Slideshow.pptx, two GH Slicer Notes variants). Per the smoke results, Renders.pptx is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
Project dep-manifest decision (see Deferred section above).

4.7 KiB Raw Blame History Unescape Escape