docs: OCR install record for 2026-05-04
Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng). Python wrappers added to venv (pip: pytesseract, ocrmypdf). This commit is the install record only. No code change — async OCR worker, capture path integration, and backlog processing are separate followups. Smoke test results captured in the file: - pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was tried first but contains only rendered designs with no text — noted as a likely candidate for exclusion rather than OCR). - ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page). Real readable English; usable as the reference timing for the eventual async worker queue. Deferred decision: project has no dependency manifest (no requirements.txt, pyproject.toml, etc). Tracking that as its own followup rather than bolting it onto this install. The capture-path integration commit will be the natural point to address it if it hasn't been resolved by then.
This commit is contained in:
@@ -0,0 +1,105 @@
|
|||||||
|
# OCR install record — 2026-05-04
|
||||||
|
|
||||||
|
## Machine
|
||||||
|
|
||||||
|
- Host: aaronai-01 (VPS)
|
||||||
|
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
|
||||||
|
|
||||||
|
## apt packages installed
|
||||||
|
|
||||||
|
| package | version | source |
|
||||||
|
|---|---|---|
|
||||||
|
| tesseract-ocr | 5.3.4-1build5 | noble |
|
||||||
|
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
|
||||||
|
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
|
||||||
|
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
|
||||||
|
|
||||||
|
## pip packages installed (into /home/aaron/aaronai/venv)
|
||||||
|
|
||||||
|
| package | version |
|
||||||
|
|---|---|
|
||||||
|
| pytesseract | 0.3.13 |
|
||||||
|
| ocrmypdf | 17.4.2 |
|
||||||
|
|
||||||
|
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
|
||||||
|
|
||||||
|
## Smoke test 1 — `tesseract --version`
|
||||||
|
|
||||||
|
```
|
||||||
|
tesseract 5.3.4
|
||||||
|
leptonica-1.82.0
|
||||||
|
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
|
||||||
|
Found AVX512BW
|
||||||
|
Found AVX512F
|
||||||
|
```
|
||||||
|
|
||||||
|
## Smoke test 2 — `tesseract --list-langs`
|
||||||
|
|
||||||
|
```
|
||||||
|
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
|
||||||
|
eng
|
||||||
|
osd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Smoke test 3 — pytesseract on a slide image
|
||||||
|
|
||||||
|
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
|
||||||
|
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
|
||||||
|
- Wall-clock: 0.220s
|
||||||
|
- Chars extracted: 126
|
||||||
|
- First 200 chars:
|
||||||
|
|
||||||
|
```
|
||||||
|
Generates the Bounding Box for NESS
|
||||||
|
|
||||||
|
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
|
||||||
|
|
||||||
|
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
|
||||||
|
|
||||||
|
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
|
||||||
|
|
||||||
|
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
|
||||||
|
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
|
||||||
|
- Wall-clock: 3.72s (whole PDF, 4 pages)
|
||||||
|
- Exit: 0
|
||||||
|
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
|
||||||
|
- First 200 chars of OCR'd text:
|
||||||
|
|
||||||
|
```
|
||||||
|
nN New Paltz
|
||||||
|
STATE UNIVERSITY OF NEW YORK
|
||||||
|
|
||||||
|
The Honors Program
|
||||||
|
|
||||||
|
May 30, 2017
|
||||||
|
|
||||||
|
Dear Aaron,
|
||||||
|
|
||||||
|
Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
|
||||||
|
Memory
|
||||||
|
```
|
||||||
|
|
||||||
|
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
|
||||||
|
|
||||||
|
## Reference timing
|
||||||
|
|
||||||
|
| operation | input size | wall-clock |
|
||||||
|
|---|---|---|
|
||||||
|
| pytesseract single image | 1768×504 PNG | 0.22s |
|
||||||
|
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
|
||||||
|
|
||||||
|
## Deferred — project dep-tracking
|
||||||
|
|
||||||
|
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
|
||||||
|
|
||||||
|
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
|
||||||
|
|
||||||
|
## Followups
|
||||||
|
|
||||||
|
- Async OCR worker (separate session). Use the reference timing above to size the queue.
|
||||||
|
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
|
||||||
|
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
|
||||||
|
- Project dep-manifest decision (see Deferred section above).
|
||||||
Reference in New Issue
Block a user