Here is a neat onliner to OCR an entire folder on Linux using Tesseract.
It is assumed your files are jpg, but you can change the onliner.
Also you can specify a language using "-l por" for Portuguese or any other language instead of English.
~/temp/pdf$ a=0; for i in $(ls -v *.jpg) ; do echo "$i page_${a}.txt" ; tesseract $i page_${a}.txt ; let a=a+1 ; done
capitulo11-0.jpg page_0.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
capitulo11-1.jpg page_1.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
capitulo11-2.jpg page_2.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
capitulo11-3.jpg page_3.txt
Tesseract Open Source OCR Engine v3.03 with Leptonica
To finish up concatenate all txt files in to a single text file
for i in $(ls -v page_*.txt) ; do cat $i ; done > all_pages.txt
and read the document:
vi all_pages.txt
- Log in to post comments