awesome-ocr

OCR toolkit

A curated list of OCR engines, tools, and formats for extracting text from images and documents.

Links to awesome OCR projects

GitHub

3k stars
129 watching
349 forks
last commit: 5 months ago
Linked from 4 awesome lists


Awesome OCR / Software / OCR engines

tesseract 62,363 10 days ago The definitive Open Source OCR engine
EasyOCR 24,528 about 2 months ago OCR engine built on PyTorch by JaidedAI,
ocropus 3,422 over 3 years ago OCR engine based on LSTM,
ocropus 0.4 17 about 13 years ago Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken 748 17 days ago Ocropus fork with sane defaults
gocr OCR engine under the GNU Public License led by Joerg Schulenburg
Ocrad The GNU OCR
ocular 255 6 months ago Machine-learning OCR for historic documents
SwiftOCR 4,622 almost 4 years ago fast and simple OCR library written in Swift
attention-ocr 1,077 about 1 year ago OCR engine using visual attention mechanisms
RWTH-OCR The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv 525 10 months ago and its - A simple pythonic OCR engine using opencv and numpy
Calamari 1,049 9 days ago OCR Engine based on OCRopy and Kraken
doctr 3,859 8 days ago A seamless & high-performing OCR library powered by Deep Learning

Awesome OCR / Software / Older and possibly abandoned OCR engines

Clara OCR Open source OCR in C
Cuneiform CuneiForm OCR was developed by Cognitive Technologies
Eye an experimental Java OCR (image-to-text) application
kognition An omnifont OCR software for KDE
OCRchie Modular Optical Character Recognition Software
ocre o.c.r. easy
xplab A GTK 2 tool for pattern matching
hebOCR 5 almost 9 years ago Hebrew character recognition library (previously named hocr, see )

Awesome OCR / Software / OCR file formats

abby2hocr.xslt XSLT script
ocr-conversion-scripts 71 over 1 year ago
hocr-tools 370 3 months ago Tools for doing various useful things with hOCR files,
hocr-spec 74 3 months ago hOCR 1.2 specification
ocr-transform 180 about 1 month ago CLI tool to convert between hOCR and ALTO,
hocr-parser 13 about 9 years ago hOCR Specification Python Parser
hOCRTools 6 over 6 years ago hOCR to ALTO conversion XSLT
ALTO XML Schema 51 4 months ago XML Schema and development of the ALTO XML format
ALTO XML Documentation 39 about 6 years ago Documentation and use cases for ALTO
alto-tools 39 about 1 year ago Various tools to work with ALTO files, Python
AbbyyToAlto 9 over 13 years ago PHP script converting from Abbyy 6 to ALTO XML
TEI-OCR 1 over 8 years ago TEI customization for OCR generated layout and content information
TEI SIG on Libraries Best Practices for TEI in Libraries
GDZ METS/TEI-based GDZ document format
PAGE-XML Schema 66 over 3 years ago XML schema of the PAGE XML format along with documentation and examples
omni:us Pages Format (OPF) XML schema very similar to PAGE XML that has some additional features
py-pagexml 13 about 1 month ago Python library for handling PAGE XML and OPF files

Awesome OCR / Software / OCR CLI

OCRmyPDF 14,140 4 days ago OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR 274 10 months ago A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported
Ocrocis Project manager interface for Ocropy, see also
tesseract-recognize 44 7 months ago Tesseract-based tool that outputs result in Page XML format ( )

Awesome OCR / Software / OCR GUI

moz-hocr-editor 10 over 9 years ago Firefox Addon for editing hOCR files
qt-box-editor 173 about 1 month ago QT4 editor of tesseract-ocr box files
ocr-gt-tools 48 almost 4 years ago Client-Server application for editing OCR ground truth
Paperwork 2,433 over 6 years ago Using scanners and OCR to grep paper documents the easy way
Paperless 7,855 over 3 years ago Scan, index, and archive all of your paper documents
gImageReader 1,634 9 days ago gImageReader is a simple Gtk/Qt front-end to tesseract-ocr
VietOCR A Java/.NET GUI frontend for Tesseract OCR engine, including a graphical Tesseract editor
PoCoTo 40 about 2 years ago Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents
OCRFeeder GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more
PRImA PAGE Viewer 35 over 1 year ago Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR
LAREX 180 10 days ago A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books
archiscribe 17 over 6 years ago Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at , results are available in
nw-page-editor 30 10 months ago Simple app for visual editing of Page XML files. Provides desktop and versions

Awesome OCR / Software / OCR Preprocessing

NoiseRemove.java in MathOCR 167 about 2 years ago Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar 2,499 8 months ago C implementations of two binarization algorithms, based on Sauvola
typeface-corpus 7 almost 10 years ago A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities
binarizewolfjolion 30 over 7 years ago Comparison of binarization algorithms
crop_morphology.py in oldnyc 288 7 days ago Cropping a page to just the text block
Whiteboard Picture Cleaner Shell one-liner/script to clean up and beautify photos of whiteboards
textcleaner Fred's ImageMagick script - Processes a scanned document of text to clean the text background
localcontrast Fast O(1) local contrast optimization

Awesome OCR / Software / OCR as a Service

Open OCR 1,342 about 1 year ago Run Tesseract in Docker containers
tesseract-web-service 135 over 1 year ago An implementation of RESTful web service for tesseract-OCR using tornado
docker-ocropy 9 almost 7 years ago A Docker container for running the
ABBYY Cloud OCR SDK Code samples 504 over 1 year ago Code samples for using the proprietary commercial ABBYY OCR API
nidaba 86 about 7 years ago An expandable and scalable OCR pipeline
gamera 39 over 2 years ago A meta-framework for building document processing applications, e.g. OCR
ocr-tools 7 over 3 years ago Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker 2 over 8 years ago Run the OCR engine in a docker container
kraken-docker 5 almost 7 years ago Run the OCR engine in a docker container
Konfuzio Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see (code is not open)
ocr.space Free Online OCR and OCR API by based on Tesseract (code is not open)
OCR4all 238 10 months ago Provides OCR services through web applications. Included Projects: , , and

Awesome OCR / Software / OCR evaluation

ISRI OCR Evaluation Tools with a

Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools

isri-ocr-evaluation-tools 57 over 3 years ago further development by (2015, 2016)
ancientgreekocr-evaluation-tools 22 over 6 years ago further development by (2013, 2014)

Awesome OCR / Software / OCR evaluation

ocrevalUAtion 67 about 2 years ago Cross-format evaluation, CLI and GUI
ngram-ocr-eval 1 over 10 years ago Brute and simple OCR evaluation using ngrams
quack 22 almost 2 years ago Quality-Assurance-tool for scans with corresponding ALTO-files

Awesome OCR / Software / OCR libraries by programming language

tesseract-ocr 13 over 2 years ago A Crystal wrapper for tesseract-ocr
tesseract_ocr 54 over 2 years ago Elixir library wrapping the tesseract executable
gosseract 2,718 4 months ago Golang OCR library, wrapping Tesseract-ocr
Tess4J 1,612 26 days ago Java Native Access bindings to Tesseract
tess-two 3,759 over 2 years ago Tools for compiling Tesseract on Android and Java API
tesseract for .net 2,291 7 months ago A .Net wrapper for tesseract-ocr
TTesseractOCR4 145 over 1 year ago Object Pascal binding for tesseract-ocr 4.x
Tesseract OCR for PHP 2,861 about 1 year ago Tesseract PHP bindings
pytesseract 5,861 24 days ago A Python wrapper for Google Tesseract
pyocr 930 over 6 years ago A Python wrapper for Tesseract and Cuneiform
ocrodjvu 45 about 2 years ago A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr 2,016 3 months ago A Python wrapper for the tesseract-ocr API
ocracy 37 almost 10 years ago pure javascript lstm rnn implementation based on ocropus
gocr.js 98 almost 11 years ago Javascript port (emscripten) of gocr
ocrad.js 3,492 about 4 years ago Javascript port (emscripten) of ocrad
tesseract.js 35,304 about 1 month ago Javascript port (emscripten) of Tesseract
node-tesseract-ocr 305 over 1 year ago A simple wrapper for the Tesseract OCR package
node-tesseract-native 51 about 6 years ago C++ module for node providing OCR with tesseract and leptonica
rtesseract 828 about 1 year ago Ruby library wrapping the tesseract and imagemagick executables
ruby-tesseract 629 over 7 years ago Native Tesseract bindings for Ruby MRI and JRuby
ocr_space 70 almost 6 years ago API wrapper for free ocr service ocr.space. Includes CLI
tesseract.rs 146 11 months ago Rust bindings for tesseract OCR
leptess Productive and safe Rust bindings/wrappers for tesseract and leptonica
tesseract 245 about 2 months ago R bindings for tesseract OCR
Tesseract OCR iOS 4,220 over 3 years ago Swift and Objective-C wrapper for Tesseract OCR
SwiftOCR 4,622 almost 4 years ago Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes

Awesome OCR / Software / OCR training tools

glyph-miner 34 about 8 years ago A system for extracting glyphs from early typeset prints
ocrodeg 160 over 4 years ago Document image degradation for OCR data augmentation

Awesome OCR / Datasets / Ground Truth

archiscribe-corpus 8 almost 6 years ago >4,200 lines transcribed from 19th Century German prints via
CIS OCR Test Set 15 over 3 years ago 2 example documents each in German/Latin/Greek with ground truth for
Rescribe 11 about 2 years ago Transcriptions of Caroline Minuscule Manuscripts
CLTK Corpora from
DIVA-HisDB 150 pages of three medieval manuscripts
EarlyPrintedBooks 10 almost 7 years ago ~8,800 lines from several early printed books
EEBO-TCP 18 over 3 years ago 25,363 EEBO documents transcribed by
ECCO-TCP 18 over 3 years ago 2,188 ECCO documents transcribed by
eMOP-TCP 3 almost 9 years ago 2,188 ECCO-TCP documents, cleaned up by
Evans-TCP 18 over 3 years ago 4,977 Evans documents transcribed by
FDHN Finnish Digitised Historical Newspapers, , (free) required,
FROC-MSS 0 almost 6 years ago 4 Old French Medieval Manuscripts
GERMANA 764 Spanish manuscript pages, (free) required
GT4HistOCR Ground Truth for German Fraktur and Early Modern Latin
imagessan 4 about 6 years ago Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL 2,418 pages from the Biodiversity Heritage Library,
IMPACT-BL 294 pages from the British Library, (free) required
IMPACT-BNE 215 pages from the National Library of Spain, (free) required,
IMPACT-BNF 151 pages from the National Library of France, (free) required
IMPACT-KB 142 pages from the National Library of the Netherlands
IMPACT-NKC 187 pages from the Czech National Library, (free) required
IMPACT-NLB 19 pages from the National Library of Bulgaria, (free) required
IMPACT-NUK 209 pages from the National Library of Slovenia, (free) required
IMPACT-PSNC 478 pages from four Polish digital libraries,
LascivaRoma/lexical 1 over 1 year ago Transcription of 19th century lexical resources for Latin learning
MJSynth 9m synthetic images covering 90k English words
OCR19thSAC 19,000 pages Swiss Alpine Club yearbooks transcribed via
OCR-D 180 pages of German historical prints from
OCR_GS_Data 15 almost 2 years ago Double-checked Arabic Gold Standard from
old-books 12 about 7 years ago 322 old books from
PRImA-ENP 528 pages historic newspapers from , (free) required
RODRIGO 853 Spanish manuscript pages, (free) required
Toebler-OCR 1 almost 6 years ago (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
IMPACT: Tools for text digitisation List of tools software projects related, some related to OCR
OCR-D List of OCR-related academic articles in the context of the project
Mendeley Group "OCR - Optical Character Recognition" Collection of 34 papers on OCR
eadh.org projects List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning] by
Ocropus Wiki: Publications 3,422 over 3 years ago

Awesome OCR / Literature / Blog Posts and Tutorials

Tesseract Blends Old and New OCR Technology 260 about 3 years ago (2016)
What You Always Wanted To Know About Tesseract (2014)
Extracting text from an image using Ocropus (2015)
Training an Ocropus OCR model (2015)
Ocropus Wiki: Compute errors and confusions 3,422 over 3 years ago (2016)
Ocropus Wiki: Working with Ground Truth 3,422 over 3 years ago (2016)
OCRopus (2016)
10 Tips for making your OCR project succeed (2013)
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
Extracting Text from PDFs; Doing OCR; all within R

Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R

R programming environment How to work with OCR from PDFs in the

Awesome OCR / Literature / Blog Posts and Tutorials

Tutorial: Command-line OCR on a Mac
Practical Expercience with OCRopus Model Training (2016)
Homemade Manuscript OCR (1): OCRopy (2017)
Optimizing Binarization for OCRopus (2017)
Prototype demo for OCR postfix in Danish Newspapers (2016)
How Can I OCR My Dictionary? (2016)
"Needlessly complex" blog (2016) . Several image processing how-tos (Python based), particularly:

Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog

Page dewarping ( )
Compressing and enhancing hand-written notes ( )
Unprojecting text with ellipses ( )

Awesome OCR / Literature / Blog Posts and Tutorials

(Open-Source-)OCR-Workflows (2017) overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the project
A gentle introduction to OCR (2018)
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts

Awesome OCR / Literature / OCR Showcases

abbyy-finereader-ocr-senate 129 over 8 years ago Using OCR to parse scanned Senate Financial Disclosure forms
cvOCR 18 about 8 years ago An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR 167 about 2 years ago A printed scientific document recognition system,

Awesome OCR / Literature / Academic articles

High performance document layout analysis (2003) Breuel
Adaptive degraded document image binarization (2006) Gatos, Pratikakis, Perantonis
[Internship Report] (2007) Gupta
OCRopus Addons (Internship Report) (2007) Dantrey
Local Logistic Classifiers for Large Scale Learning (2012) Yousefi, Breuel
High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait
Can we build language-independent OCR using LSTM networks? (2013) Ul-Hasan, Breuel
Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel
OCR of historical printings of Latin texts: Problems, Prospects, Progress. (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink
Correcting Noisy OCR: Context beats Confusion (2014) Evershed, Fitch
TypeWright: An Experiment in Participatory Curation (2015) Bilansky
Benchmarking of LSTM Networks (2015) Breuel
Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel
A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
Important New Developments in Arabographic Optical Character Recognition (OCR) (2016) Romanov, Miller, Savant, Kiessling

Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR)

OpenArabic/OCR_GS_Data 13 over 7 years ago using for ground truth data

Awesome OCR / Literature / Academic articles

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus (2016) Springmann, Lüdeling
Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016) Springmann, Fink, Schulz
Generic Text Recognition using Long Short-Term Memory Networks (2016) Ul-Hasan -- Ph.D Thesis
OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters (2016) Dengel, Ul-Hasan, Bukhari
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016) Lee, Osindero
Telugu OCR Framework using Deep Learning (2015/2017) , Hastie

Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning

TeluguOCR see also , , ,

Awesome OCR / Literature / Academic articles

A Two-Stage Method for Text Line Detection in Historical Documents (2018) , Leifert, Strauß, Labahn. Code available at

Backlinks from these awesome lists:

More related projects: