pdf2pdfocr

PDF extractor

A tool to extract text from PDFs and add a searchable layer to them

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!

GitHub

279 stars
12 watching
35 forks
Language: Python
last commit: 11 months ago
Linked from 1 awesome list

dockerocrpdfpdftkpythontesseract

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
steelthread/mimeograph A CoffeeScript library to extract text and create searchable PDF files using OCR when necessary. 28
unidoc/unidoc A Go library for extracting text from PDF files, particularly invoices. 708
tabulapdf/tabula-java Extracts tables from PDF files using Java 1,859
uglytoad/pdfpig A C# library for extracting and analyzing text from PDF files 1,771
ckorzen/pdf-text-extraction-benchmark Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles 65
aeksco/aws-pdf-textract-pipeline A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services 164
jesparza/peepdf A Python tool for analyzing PDF files to identify potential security risks and malicious content. 1,319
malfrats/xeuledoc A tool to fetch information about public Google documents from various services 856
docraptor/docraptor-ruby A Ruby client library for converting HTML to PDF using the DocRaptor API. 33
hiddenillusion/analyzepdf A tool to analyze PDF files by examining their characteristics to determine if they are malicious or benign. 178
pdf-archiver/pdf-archiver A tool for digitizing and organizing paper documents by scanning and tagging files for easy searching. 308
jonmagic/grim A tool for extracting pages from PDFs and converting them to images and text strings. 216
bikash/documentunderstanding Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. 96
philsturgeon/codeigniter-unzip A CodeIgniter extension that extracts ZIP files without requiring PECL extensions 78
enferex/pdfresurrect Analyzes and extracts previous versions of a PDF document to reconstruct its modification history 81