pdf2pdfocr

PDF extractor

A tool to extract text from PDFs and add a searchable layer to them

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!

GitHub

279 stars

12 watching

35 forks

Language: Python

last commit: over 2 years ago

Linked from 1 awesome list

dockerocrpdfpdftkpythontesseract

Backlinks from these awesome lists:

kba/awesome-ocr

Related projects:

Repository	Description	Stars
steelthread/mimeograph	A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities	28
unidoc/unidoc	A Go library for extracting text from PDF files, particularly invoices.	708
tabulapdf/tabula-java	Extracts tables from PDF files using Java	1,859
uglytoad/pdfpig	A C# library for extracting and analyzing text from PDF files	1,794
ckorzen/pdf-text-extraction-benchmark	Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles	65
aeksco/aws-pdf-textract-pipeline	A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services	164
jesparza/peepdf	A Python tool for analyzing PDF files to identify potential security risks and malicious content.	1,319
malfrats/xeuledoc	A tool to fetch information about public Google documents from various services	856
docraptor/docraptor-ruby	A Ruby client library for converting HTML to PDF using the DocRaptor API.	33
hiddenillusion/analyzepdf	A tool to analyze PDF files by examining their characteristics to determine if they are malicious or benign.	178
pdf-archiver/pdf-archiver	A tool for digitizing and organizing paper documents by scanning and tagging files for easy searching.	308
jonmagic/grim	A tool for extracting pages from PDFs and converting them to images and text strings.	216
bikash/documentunderstanding	Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks.	96
philsturgeon/codeigniter-unzip	A CodeIgniter extension that extracts ZIP files without requiring PECL extensions	78
enferex/pdfresurrect	Analyzes and extracts previous versions of a PDF document to reconstruct its modification history	81