OCR CONVERSION TOOL

This tool converts the image PDF files to the searchable PDF-A format.


For implementation of tool Python 3.6.3 programming language is used and HTML code is used for front end development.
OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content.


It uses Ghostscript to rasterize the page, and then performs on OCR on the rasterized image to create an OCR “layer”. The layer is then grafted back onto the original PDF.


OCRmyPDF can produce a minimally changed PDF as output. OCRmyPDF also some image processing options like deskew which improve the appearance of files and quality of OCR. When these are used, the OCR layer is grafted onto the processed image instead. By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with –output-type pdf.