awesome-ocr

OCR toolkit

A curated list of OCR engines, tools, and formats for extracting text from images and documents.

Links to awesome OCR projects

GitHub

3k stars

128 watching

352 forks

last commit: about 1 year ago

Linked from 4 awesome lists

github.com/kba/awesome-ocr

Awesome OCR / Software / OCR engines
tesseract	63,142	7 months ago	The definitive Open Source OCR engine
EasyOCR	24,876	10 months ago	OCR engine built on PyTorch by JaidedAI,
ocropus	3,426	about 4 years ago	OCR engine based on LSTM,
ocropus 0.4	17	almost 14 years ago	Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken	757	8 months ago	Ocropus fork with sane defaults
gocr			OCR engine under the GNU Public License led by Joerg Schulenburg
Ocrad			The GNU OCR
ocular	256	about 1 year ago	Machine-learning OCR for historic documents
SwiftOCR	4,623	over 4 years ago	fast and simple OCR library written in Swift
attention-ocr	1,079	almost 2 years ago	OCR engine using visual attention mechanisms
RWTH-OCR			The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv	525	over 1 year ago	and its - A simple pythonic OCR engine using opencv and numpy
Calamari	1,056	8 months ago	OCR Engine based on OCRopy and Kraken
doctr	4,011	7 months ago	A seamless & high-performing OCR library powered by Deep Learning
Awesome OCR / Software / Older and possibly abandoned OCR engines
Clara OCR			Open source OCR in C
Cuneiform			CuneiForm OCR was developed by Cognitive Technologies
Eye			an experimental Java OCR (image-to-text) application
kognition			An omnifont OCR software for KDE
OCRchie			Modular Optical Character Recognition Software
ocre			o.c.r. easy
xplab			A GTK 2 tool for pattern matching
hebOCR	5	over 9 years ago	Hebrew character recognition library (previously named hocr, see )
Awesome OCR / Software / OCR file formats
abby2hocr.xslt XSLT script
ocr-conversion-scripts	72	about 2 years ago
hocr-tools	373	12 months ago	Tools for doing various useful things with hOCR files,
hocr-spec	74	12 months ago	hOCR 1.2 specification
ocr-transform	182	10 months ago	CLI tool to convert between hOCR and ALTO,
hocr-parser	13	almost 10 years ago	hOCR Specification Python Parser
hOCRTools	6	almost 7 years ago	hOCR to ALTO conversion XSLT
ALTO XML Schema	52	about 1 year ago	XML Schema and development of the ALTO XML format
ALTO XML Documentation	39	almost 7 years ago	Documentation and use cases for ALTO
alto-tools	40	almost 2 years ago	Various tools to work with ALTO files, Python
AbbyyToAlto	9	about 14 years ago	PHP script converting from Abbyy 6 to ALTO XML
TEI-OCR	1	over 9 years ago	TEI customization for OCR generated layout and content information
TEI SIG on Libraries			Best Practices for TEI in Libraries
GDZ			METS/TEI-based GDZ document format
PAGE-XML Schema	66	about 4 years ago	XML schema of the PAGE XML format along with documentation and examples
omni:us Pages Format (OPF)			XML schema very similar to PAGE XML that has some additional features
py-pagexml	13	9 months ago	Python library for handling PAGE XML and OPF files
Awesome OCR / Software / OCR CLI
OCRmyPDF	14,363	7 months ago	OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR	279	over 1 year ago	A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported
Ocrocis			Project manager interface for Ocropy, see also
tesseract-recognize	44	over 1 year ago	Tesseract-based tool that outputs result in Page XML format ( )
Awesome OCR / Software / OCR GUI
moz-hocr-editor	10	over 10 years ago	Firefox Addon for editing hOCR files
qt-box-editor	173	9 months ago	QT4 editor of tesseract-ocr box files
ocr-gt-tools	48	over 4 years ago	Client-Server application for editing OCR ground truth
Paperwork	2,431	about 7 years ago	Using scanners and OCR to grep paper documents the easy way
Paperless	7,864	over 4 years ago	Scan, index, and archive all of your paper documents
gImageReader	1,653	7 months ago	gImageReader is a simple Gtk/Qt front-end to tesseract-ocr
VietOCR			A Java/.NET GUI frontend for Tesseract OCR engine, including a graphical Tesseract editor
PoCoTo	40	over 2 years ago	Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents
OCRFeeder			GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more
PRImA PAGE Viewer	35	about 2 years ago	Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR
LAREX	181	8 months ago	A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books
archiscribe	17	over 7 years ago	Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at , results are available in
nw-page-editor	30	over 1 year ago	Simple app for visual editing of Page XML files. Provides desktop and versions
Awesome OCR / Software / OCR Preprocessing
NoiseRemove.java in MathOCR	168	over 2 years ago	Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar	2,503	over 1 year ago	C implementations of two binarization algorithms, based on Sauvola
typeface-corpus	7	over 10 years ago	A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities
binarizewolfjolion	30	almost 8 years ago	Comparison of binarization algorithms
crop_morphology.py in oldnyc	289	8 months ago	Cropping a page to just the text block
Whiteboard Picture Cleaner			Shell one-liner/script to clean up and beautify photos of whiteboards
textcleaner			Fred's ImageMagick script - Processes a scanned document of text to clean the text background
localcontrast			Fast O(1) local contrast optimization
Awesome OCR / Software / OCR as a Service
Open OCR	1,346	almost 2 years ago	Run Tesseract in Docker containers
tesseract-web-service	135	about 2 years ago	An implementation of RESTful web service for tesseract-OCR using tornado
docker-ocropy	9	over 7 years ago	A Docker container for running the
ABBYY Cloud OCR SDK Code samples	504	about 2 years ago	Code samples for using the proprietary commercial ABBYY OCR API
nidaba	86	over 7 years ago	An expandable and scalable OCR pipeline
gamera	39	almost 3 years ago	A meta-framework for building document processing applications, e.g. OCR
ocr-tools	7	about 4 years ago	Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker	2	almost 9 years ago	Run the OCR engine in a docker container
kraken-docker	5	over 7 years ago	Run the OCR engine in a docker container
Konfuzio			Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see (code is not open)
ocr.space			Free Online OCR and OCR API by based on Tesseract (code is not open)
OCR4all	244	over 1 year ago	Provides OCR services through web applications. Included Projects: , , and
Awesome OCR / Software / OCR evaluation
ISRI OCR Evaluation Tools			with a
Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools
isri-ocr-evaluation-tools	57	over 4 years ago	further development by (2015, 2016)
ancientgreekocr-evaluation-tools	22	over 7 years ago	further development by (2013, 2014)
Awesome OCR / Software / OCR evaluation
ocrevalUAtion	67	almost 3 years ago	Cross-format evaluation, CLI and GUI
ngram-ocr-eval	1	over 11 years ago	Brute and simple OCR evaluation using ngrams
quack	22	over 2 years ago	Quality-Assurance-tool for scans with corresponding ALTO-files
Awesome OCR / Software / OCR libraries by programming language
tesseract-ocr	13	about 3 years ago	A Crystal wrapper for tesseract-ocr
tesseract_ocr	55	about 3 years ago	Elixir library wrapping the tesseract executable
gosseract	2,751	12 months ago	Golang OCR library, wrapping Tesseract-ocr
Tess4J	1,619	8 months ago	Java Native Access bindings to Tesseract
tess-two	3,761	over 3 years ago	Tools for compiling Tesseract on Android and Java API
tesseract for .net	2,308	over 1 year ago	A .Net wrapper for tesseract-ocr
TTesseractOCR4	145	about 2 years ago	Object Pascal binding for tesseract-ocr 4.x
Tesseract OCR for PHP	2,897	almost 2 years ago	Tesseract PHP bindings
pytesseract	5,919	8 months ago	A Python wrapper for Google Tesseract
pyocr	930	about 7 years ago	A Python wrapper for Tesseract and Cuneiform
ocrodjvu	46	almost 3 years ago	A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr	2,026	8 months ago	A Python wrapper for the tesseract-ocr API
ocracy	37	over 10 years ago	pure javascript lstm rnn implementation based on ocropus
gocr.js	98	over 11 years ago	Javascript port (emscripten) of gocr
ocrad.js	3,494	almost 5 years ago	Javascript port (emscripten) of ocrad
tesseract.js	35,553	8 months ago	Javascript port (emscripten) of Tesseract
node-tesseract-ocr	308	about 2 years ago	A simple wrapper for the Tesseract OCR package
node-tesseract-native	51	over 6 years ago	C++ module for node providing OCR with tesseract and leptonica
rtesseract	838	almost 2 years ago	Ruby library wrapping the tesseract and imagemagick executables
ruby-tesseract	629	about 8 years ago	Native Tesseract bindings for Ruby MRI and JRuby
ocr_space	70	over 6 years ago	API wrapper for free ocr service ocr.space. Includes CLI
tesseract.rs	148	over 1 year ago	Rust bindings for tesseract OCR
leptess			Productive and safe Rust bindings/wrappers for tesseract and leptonica
tesseract	245	10 months ago	R bindings for tesseract OCR
Tesseract OCR iOS	4,220	about 4 years ago	Swift and Objective-C wrapper for Tesseract OCR
SwiftOCR	4,623	over 4 years ago	Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes
Awesome OCR / Software / OCR training tools
glyph-miner	34	almost 9 years ago	A system for extracting glyphs from early typeset prints
ocrodeg	161	about 5 years ago	Document image degradation for OCR data augmentation
Awesome OCR / Datasets / Ground Truth
archiscribe-corpus	8	over 6 years ago	>4,200 lines transcribed from 19th Century German prints via
CIS OCR Test Set	15	about 4 years ago	2 example documents each in German/Latin/Greek with ground truth for
Rescribe	11	almost 3 years ago	Transcriptions of Caroline Minuscule Manuscripts
CLTK			Corpora from
DIVA-HisDB			150 pages of three medieval manuscripts
EarlyPrintedBooks	10	over 7 years ago	~8,800 lines from several early printed books
EEBO-TCP	18	over 4 years ago	25,363 EEBO documents transcribed by
ECCO-TCP	18	over 4 years ago	2,188 ECCO documents transcribed by
eMOP-TCP	3	over 9 years ago	2,188 ECCO-TCP documents, cleaned up by
Evans-TCP	18	over 4 years ago	4,977 Evans documents transcribed by
FDHN			Finnish Digitised Historical Newspapers, , (free) required,
FROC-MSS	0	over 6 years ago	4 Old French Medieval Manuscripts
GERMANA			764 Spanish manuscript pages, (free) required
GT4HistOCR			Ground Truth for German Fraktur and Early Modern Latin
imagessan	4	almost 7 years ago	Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL			2,418 pages from the Biodiversity Heritage Library,
IMPACT-BL			294 pages from the British Library, (free) required
IMPACT-BNE			215 pages from the National Library of Spain, (free) required,
IMPACT-BNF			151 pages from the National Library of France, (free) required
IMPACT-KB			142 pages from the National Library of the Netherlands
IMPACT-NKC			187 pages from the Czech National Library, (free) required
IMPACT-NLB			19 pages from the National Library of Bulgaria, (free) required
IMPACT-NUK			209 pages from the National Library of Slovenia, (free) required
IMPACT-PSNC			478 pages from four Polish digital libraries,
LascivaRoma/lexical	1	about 2 years ago	Transcription of 19th century lexical resources for Latin learning
MJSynth			9m synthetic images covering 90k English words
OCR19thSAC			19,000 pages Swiss Alpine Club yearbooks transcribed via
OCR-D			180 pages of German historical prints from
OCR_GS_Data	15	over 2 years ago	Double-checked Arabic Gold Standard from
old-books	12	almost 8 years ago	322 old books from
PRImA-ENP			528 pages historic newspapers from , (free) required
RODRIGO			853 Spanish manuscript pages, (free) required
Toebler-OCR	1	over 6 years ago	(Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
Awesome OCR / Literature / OCR-related publication and link lists
IMPACT: Tools for text digitisation			List of tools software projects related, some related to OCR
OCR-D			List of OCR-related academic articles in the context of the project
Mendeley Group "OCR - Optical Character Recognition"			Collection of 34 papers on OCR
eadh.org projects			List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning]			by
Ocropus Wiki: Publications	3,426	about 4 years ago
Awesome OCR / Literature / Blog Posts and Tutorials
Tesseract Blends Old and New OCR Technology	262	almost 4 years ago	(2016)
What You Always Wanted To Know About Tesseract			(2014)
Extracting text from an image using Ocropus			(2015)
Training an Ocropus OCR model			(2015)
Ocropus Wiki: Compute errors and confusions	3,426	about 4 years ago	(2016)
Ocropus Wiki: Working with Ground Truth	3,426	about 4 years ago	(2016)
OCRopus			(2016)
10 Tips for making your OCR project succeed			(2013)
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology			-
Extracting Text from PDFs; Doing OCR; all within R
Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R
R programming environment			How to work with OCR from PDFs in the
Awesome OCR / Literature / Blog Posts and Tutorials
Tutorial: Command-line OCR on a Mac
Practical Expercience with OCRopus Model Training			(2016)
Homemade Manuscript OCR (1): OCRopy			(2017)
Optimizing Binarization for OCRopus			(2017)
Prototype demo for OCR postfix in Danish Newspapers			(2016)
How Can I OCR My Dictionary?			(2016)
"Needlessly complex" blog			(2016) . Several image processing how-tos (Python based), particularly:
Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog
Page dewarping			( )
Compressing and enhancing hand-written notes			( )
Unprojecting text with ellipses			( )
Awesome OCR / Literature / Blog Posts and Tutorials
(Open-Source-)OCR-Workflows			(2017) overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the project
A gentle introduction to OCR			(2018)
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR			(2019) A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts
Awesome OCR / Literature / OCR Showcases
abbyy-finereader-ocr-senate	129	over 9 years ago	Using OCR to parse scanned Senate Financial Disclosure forms
cvOCR	18	over 8 years ago	An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR	168	over 2 years ago	A printed scientific document recognition system,
Awesome OCR / Literature / Academic articles
High performance document layout analysis			(2003) Breuel
Adaptive degraded document image binarization			(2006) Gatos, Pratikakis, Perantonis
[Internship Report]			(2007) Gupta
OCRopus Addons (Internship Report)			(2007) Dantrey
Local Logistic Classifiers for Large Scale Learning			(2012) Yousefi, Breuel
High Performance OCR for Printed English and Fraktur using LSTM Networks			(2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait
Can we build language-independent OCR using LSTM networks?			(2013) Ul-Hasan, Breuel
Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks			(2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel
OCR of historical printings of Latin texts: Problems, Prospects, Progress.			(2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink
Correcting Noisy OCR: Context beats Confusion			(2014) Evershed, Fitch
TypeWright: An Experiment in Participatory Curation			(2015) Bilansky
Benchmarking of LSTM Networks			(2015) Breuel
Recognition of Historical Greek Polytonic Scripts Using LSTM			(2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
A Segmentation-Free Approach for Printed Devanagari Script Recognition			(2015) Karayil, Ul-Hasan, Breuel
A Sequence Learning Approach for Multiple Script Identification			(2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
Important New Developments in Arabographic Optical Character Recognition (OCR)			(2016) Romanov, Miller, Savant, Kiessling
Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR)
OpenArabic/OCR_GS_Data	13	about 8 years ago	using for ground truth data
Awesome OCR / Literature / Academic articles
OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus			(2016) Springmann, Lüdeling
Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents			(2016) Springmann, Fink, Schulz
Generic Text Recognition using Long Short-Term Memory Networks			(2016) Ul-Hasan -- Ph.D Thesis
OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters			(2016) Dengel, Ul-Hasan, Bukhari
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild			(2016) Lee, Osindero
Telugu OCR Framework using Deep Learning			(2015/2017) , Hastie
Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning
TeluguOCR			see also , , ,
Awesome OCR / Literature / Academic articles
A Two-Stage Method for Text Line Detection in Historical Documents			(2018) , Leifert, Strauß, Labahn. Code available at

awesome-ocr

Awesome OCR / Software / OCR engines

Awesome OCR / Software / Older and possibly abandoned OCR engines

Awesome OCR / Software / OCR file formats

Awesome OCR / Software / OCR CLI

Awesome OCR / Software / OCR GUI

Awesome OCR / Software / OCR Preprocessing

Awesome OCR / Software / OCR as a Service

Awesome OCR / Software / OCR evaluation

Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools

Awesome OCR / Software / OCR evaluation

Awesome OCR / Software / OCR libraries by programming language

Awesome OCR / Software / OCR training tools

Awesome OCR / Datasets / Ground Truth

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / OCR Showcases

Awesome OCR / Literature / Academic articles

Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR)

Awesome OCR / Literature / Academic articles

Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning

Awesome OCR / Literature / Academic articles

Backlinks from these awesome lists:

More related projects:

awesome-ocr

Awesome OCR / Software / OCR engines

Awesome OCR / Software / Older and possibly abandoned OCR engines

Awesome OCR / Software / OCR file formats

Awesome OCR / Software / OCR CLI

Awesome OCR / Software / OCR GUI

Awesome OCR / Software / OCR Preprocessing

Awesome OCR / Software / OCR as a Service

Awesome OCR / Software / OCR evaluation

Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools

Awesome OCR / Software / OCR evaluation

Awesome OCR / Software / OCR libraries by programming language

Awesome OCR / Software / OCR training tools

Awesome OCR / Datasets / Ground Truth

Awesome OCR / Literature / OCR-related publication and link lists

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog

Awesome OCR / Literature / Blog Posts and Tutorials

Awesome OCR / Literature / OCR Showcases

Awesome OCR / Literature / Academic articles

Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR)

Awesome OCR / Literature / Academic articles

Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning

Awesome OCR / Literature / Academic articles

Backlinks from these awesome lists:

More related projects: