How to OCR an entire folder on Linux using Tesseract

Submitted by ricardo on Sat, 01/03/2015 - 19:53

Here is a neat onliner to OCR an entire folder on Linux using Tesseract.
It is assumed your files are jpg, but you can change the onliner.

Also you can specify a language using "-l por" for Portuguese or any other language instead of English.


~/temp/pdf$ a=0; for i in $(ls -v *.jpg) ; do echo "$i page_${a}.txt" ; tesseract $i page_${a}.txt ; let a=a+1 ; done
capitulo11-0.jpg page_0.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
capitulo11-1.jpg page_1.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
capitulo11-2.jpg page_2.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
capitulo11-3.jpg page_3.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica

To finish up concatenate all txt files in to a single text file

for i in $(ls -v page_*.txt) ; do cat $i ; done > all_pages.txt

and read the document:

vi all_pages.txt

Powered by Drupal