Apply OCR on scanned PDF files to extract text from the PDF images.
Apply OCR on scanned PDF files to extract text from the PDF images.
This version expects the texts to be written on brazilian portuguese (pt-br).
To setup the environment on Ubuntu run the code on Ubuntu terminal:
chmod a+x setup.sh # run this line only the first time
./setup.sh
The code above will install Tesseract, brazilian portuguese language to the Tesseract OCR, imageMagick, and setup the policy.xml file from imageMagick to convert PDF files.
Copy the script pdf_ocr.sh
to the folder containing the scanned PDF files and execute it:
chmod a+x pdf_ocr.sh # run this line only the first time
./pdf_ocr.sh
The script outputs the following: