awesome-ocr
OCR toolkit
A curated list of OCR engines, tools, and formats for extracting text from images and documents.
Links to awesome OCR projects
3k stars
129 watching
349 forks
last commit: 5 months ago
Linked from 4 awesome lists
Awesome OCR / Software / OCR engines | |||
tesseract | 62,363 | 10 days ago | The definitive Open Source OCR engine |
EasyOCR | 24,528 | about 2 months ago | OCR engine built on PyTorch by JaidedAI, |
ocropus | 3,422 | over 3 years ago | OCR engine based on LSTM, |
ocropus 0.4 | 17 | about 13 years ago | Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++ |
kraken | 748 | 17 days ago | Ocropus fork with sane defaults |
gocr | OCR engine under the GNU Public License led by Joerg Schulenburg | ||
Ocrad | The GNU OCR | ||
ocular | 255 | 6 months ago | Machine-learning OCR for historic documents |
SwiftOCR | 4,622 | almost 4 years ago | fast and simple OCR library written in Swift |
attention-ocr | 1,077 | about 1 year ago | OCR engine using visual attention mechanisms |
RWTH-OCR | The RWTH Aachen University Optical Character Recognition System | ||
simple-ocr-opencv | 525 | 10 months ago | and its - A simple pythonic OCR engine using opencv and numpy |
Calamari | 1,049 | 9 days ago | OCR Engine based on OCRopy and Kraken |
doctr | 3,859 | 8 days ago | A seamless & high-performing OCR library powered by Deep Learning |
Awesome OCR / Software / Older and possibly abandoned OCR engines | |||
Clara OCR | Open source OCR in C | ||
Cuneiform | CuneiForm OCR was developed by Cognitive Technologies | ||
Eye | an experimental Java OCR (image-to-text) application | ||
kognition | An omnifont OCR software for KDE | ||
OCRchie | Modular Optical Character Recognition Software | ||
ocre | o.c.r. easy | ||
xplab | A GTK 2 tool for pattern matching | ||
hebOCR | 5 | almost 9 years ago | Hebrew character recognition library (previously named hocr, see ) |
Awesome OCR / Software / OCR file formats | |||
abby2hocr.xslt XSLT script | |||
ocr-conversion-scripts | 71 | over 1 year ago | |
hocr-tools | 370 | 3 months ago | Tools for doing various useful things with hOCR files, |
hocr-spec | 74 | 3 months ago | hOCR 1.2 specification |
ocr-transform | 180 | about 1 month ago | CLI tool to convert between hOCR and ALTO, |
hocr-parser | 13 | about 9 years ago | hOCR Specification Python Parser |
hOCRTools | 6 | over 6 years ago | hOCR to ALTO conversion XSLT |
ALTO XML Schema | 51 | 4 months ago | XML Schema and development of the ALTO XML format |
ALTO XML Documentation | 39 | about 6 years ago | Documentation and use cases for ALTO |
alto-tools | 39 | about 1 year ago | Various tools to work with ALTO files, Python |
AbbyyToAlto | 9 | over 13 years ago | PHP script converting from Abbyy 6 to ALTO XML |
TEI-OCR | 1 | over 8 years ago | TEI customization for OCR generated layout and content information |
TEI SIG on Libraries | Best Practices for TEI in Libraries | ||
GDZ | METS/TEI-based GDZ document format | ||
PAGE-XML Schema | 66 | over 3 years ago | XML schema of the PAGE XML format along with documentation and examples |
omni:us Pages Format (OPF) | XML schema very similar to PAGE XML that has some additional features | ||
py-pagexml | 13 | about 1 month ago | Python library for handling PAGE XML and OPF files |
Awesome OCR / Software / OCR CLI | |||
OCRmyPDF | 14,140 | 4 days ago | OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched |
Pdf2PdfOCR | 274 | 10 months ago | A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported |
Ocrocis | Project manager interface for Ocropy, see also | ||
tesseract-recognize | 44 | 7 months ago | Tesseract-based tool that outputs result in Page XML format ( ) |
Awesome OCR / Software / OCR GUI | |||
moz-hocr-editor | 10 | over 9 years ago | Firefox Addon for editing hOCR files |
qt-box-editor | 173 | about 1 month ago | QT4 editor of tesseract-ocr box files |
ocr-gt-tools | 48 | almost 4 years ago | Client-Server application for editing OCR ground truth |
Paperwork | 2,433 | over 6 years ago | Using scanners and OCR to grep paper documents the easy way |
Paperless | 7,855 | over 3 years ago | Scan, index, and archive all of your paper documents |
gImageReader | 1,634 | 9 days ago | gImageReader is a simple Gtk/Qt front-end to tesseract-ocr |
VietOCR | A Java/.NET GUI frontend for Tesseract OCR engine, including a graphical Tesseract editor | ||
PoCoTo | 40 | about 2 years ago | Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents |
OCRFeeder | GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more | ||
PRImA PAGE Viewer | 35 | over 1 year ago | Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR |
LAREX | 180 | 10 days ago | A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books |
archiscribe | 17 | over 6 years ago | Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at , results are available in |
nw-page-editor | 30 | 10 months ago | Simple app for visual editing of Page XML files. Provides desktop and versions |
Awesome OCR / Software / OCR Preprocessing | |||
NoiseRemove.java in MathOCR | 167 | about 2 years ago | Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis |
binarize.c in ZBar | 2,499 | 8 months ago | C implementations of two binarization algorithms, based on Sauvola |
typeface-corpus | 7 | almost 10 years ago | A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities |
binarizewolfjolion | 30 | over 7 years ago | Comparison of binarization algorithms |
crop_morphology.py in oldnyc | 288 | 7 days ago | Cropping a page to just the text block |
Whiteboard Picture Cleaner | Shell one-liner/script to clean up and beautify photos of whiteboards | ||
textcleaner | Fred's ImageMagick script - Processes a scanned document of text to clean the text background | ||
localcontrast | Fast O(1) local contrast optimization | ||
Awesome OCR / Software / OCR as a Service | |||
Open OCR | 1,342 | about 1 year ago | Run Tesseract in Docker containers |
tesseract-web-service | 135 | over 1 year ago | An implementation of RESTful web service for tesseract-OCR using tornado |
docker-ocropy | 9 | almost 7 years ago | A Docker container for running the |
ABBYY Cloud OCR SDK Code samples | 504 | over 1 year ago | Code samples for using the proprietary commercial ABBYY OCR API |
nidaba | 86 | about 7 years ago | An expandable and scalable OCR pipeline |
gamera | 39 | over 2 years ago | A meta-framework for building document processing applications, e.g. OCR |
ocr-tools | 7 | over 3 years ago | Project to provide CLI and web service interfaces to common OCR engines |
ocrad-docker | 2 | over 8 years ago | Run the OCR engine in a docker container |
kraken-docker | 5 | almost 7 years ago | Run the OCR engine in a docker container |
Konfuzio | Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see (code is not open) | ||
ocr.space | Free Online OCR and OCR API by based on Tesseract (code is not open) | ||
OCR4all | 238 | 10 months ago | Provides OCR services through web applications. Included Projects: , , and |
Awesome OCR / Software / OCR evaluation | |||
ISRI OCR Evaluation Tools | with a | ||
Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools | |||
isri-ocr-evaluation-tools | 57 | over 3 years ago | further development by (2015, 2016) |
ancientgreekocr-evaluation-tools | 22 | over 6 years ago | further development by (2013, 2014) |
Awesome OCR / Software / OCR evaluation | |||
ocrevalUAtion | 67 | about 2 years ago | Cross-format evaluation, CLI and GUI |
ngram-ocr-eval | 1 | over 10 years ago | Brute and simple OCR evaluation using ngrams |
quack | 22 | almost 2 years ago | Quality-Assurance-tool for scans with corresponding ALTO-files |
Awesome OCR / Software / OCR libraries by programming language | |||
tesseract-ocr | 13 | over 2 years ago | A Crystal wrapper for tesseract-ocr |
tesseract_ocr | 54 | over 2 years ago | Elixir library wrapping the tesseract executable |
gosseract | 2,718 | 4 months ago | Golang OCR library, wrapping Tesseract-ocr |
Tess4J | 1,612 | 26 days ago | Java Native Access bindings to Tesseract |
tess-two | 3,759 | over 2 years ago | Tools for compiling Tesseract on Android and Java API |
tesseract for .net | 2,291 | 7 months ago | A .Net wrapper for tesseract-ocr |
TTesseractOCR4 | 145 | over 1 year ago | Object Pascal binding for tesseract-ocr 4.x |
Tesseract OCR for PHP | 2,861 | about 1 year ago | Tesseract PHP bindings |
pytesseract | 5,861 | 24 days ago | A Python wrapper for Google Tesseract |
pyocr | 930 | over 6 years ago | A Python wrapper for Tesseract and Cuneiform |
ocrodjvu | 45 | about 2 years ago | A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract |
tesserocr | 2,016 | 3 months ago | A Python wrapper for the tesseract-ocr API |
ocracy | 37 | almost 10 years ago | pure javascript lstm rnn implementation based on ocropus |
gocr.js | 98 | almost 11 years ago | Javascript port (emscripten) of gocr |
ocrad.js | 3,492 | about 4 years ago | Javascript port (emscripten) of ocrad |
tesseract.js | 35,304 | about 1 month ago | Javascript port (emscripten) of Tesseract |
node-tesseract-ocr | 305 | over 1 year ago | A simple wrapper for the Tesseract OCR package |
node-tesseract-native | 51 | about 6 years ago | C++ module for node providing OCR with tesseract and leptonica |
rtesseract | 828 | about 1 year ago | Ruby library wrapping the tesseract and imagemagick executables |
ruby-tesseract | 629 | over 7 years ago | Native Tesseract bindings for Ruby MRI and JRuby |
ocr_space | 70 | almost 6 years ago | API wrapper for free ocr service ocr.space. Includes CLI |
tesseract.rs | 146 | 11 months ago | Rust bindings for tesseract OCR |
leptess | Productive and safe Rust bindings/wrappers for tesseract and leptonica | ||
tesseract | 245 | about 2 months ago | R bindings for tesseract OCR |
Tesseract OCR iOS | 4,220 | over 3 years ago | Swift and Objective-C wrapper for Tesseract OCR |
SwiftOCR | 4,622 | almost 4 years ago | Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes |
Awesome OCR / Software / OCR training tools | |||
glyph-miner | 34 | about 8 years ago | A system for extracting glyphs from early typeset prints |
ocrodeg | 160 | over 4 years ago | Document image degradation for OCR data augmentation |
Awesome OCR / Datasets / Ground Truth | |||
archiscribe-corpus | 8 | almost 6 years ago | >4,200 lines transcribed from 19th Century German prints via |
CIS OCR Test Set | 15 | over 3 years ago | 2 example documents each in German/Latin/Greek with ground truth for |
Rescribe | 11 | about 2 years ago | Transcriptions of Caroline Minuscule Manuscripts |
CLTK | Corpora from | ||
DIVA-HisDB | 150 pages of three medieval manuscripts | ||
EarlyPrintedBooks | 10 | almost 7 years ago | ~8,800 lines from several early printed books |
EEBO-TCP | 18 | over 3 years ago | 25,363 EEBO documents transcribed by |
ECCO-TCP | 18 | over 3 years ago | 2,188 ECCO documents transcribed by |
eMOP-TCP | 3 | almost 9 years ago | 2,188 ECCO-TCP documents, cleaned up by |
Evans-TCP | 18 | over 3 years ago | 4,977 Evans documents transcribed by |
FDHN | Finnish Digitised Historical Newspapers, , (free) required, | ||
FROC-MSS | 0 | almost 6 years ago | 4 Old French Medieval Manuscripts |
GERMANA | 764 Spanish manuscript pages, (free) required | ||
GT4HistOCR | Ground Truth for German Fraktur and Early Modern Latin | ||
imagessan | 4 | about 6 years ago | Sanskrit images & ground truth (Devanagari script) |
IMPACT-BHL | 2,418 pages from the Biodiversity Heritage Library, | ||
IMPACT-BL | 294 pages from the British Library, (free) required | ||
IMPACT-BNE | 215 pages from the National Library of Spain, (free) required, | ||
IMPACT-BNF | 151 pages from the National Library of France, (free) required | ||
IMPACT-KB | 142 pages from the National Library of the Netherlands | ||
IMPACT-NKC | 187 pages from the Czech National Library, (free) required | ||
IMPACT-NLB | 19 pages from the National Library of Bulgaria, (free) required | ||
IMPACT-NUK | 209 pages from the National Library of Slovenia, (free) required | ||
IMPACT-PSNC | 478 pages from four Polish digital libraries, | ||
LascivaRoma/lexical | 1 | over 1 year ago | Transcription of 19th century lexical resources for Latin learning |
MJSynth | 9m synthetic images covering 90k English words | ||
OCR19thSAC | 19,000 pages Swiss Alpine Club yearbooks transcribed via | ||
OCR-D | 180 pages of German historical prints from | ||
OCR_GS_Data | 15 | almost 2 years ago | Double-checked Arabic Gold Standard from |
old-books | 12 | about 7 years ago | 322 old books from |
PRImA-ENP | 528 pages historic newspapers from , (free) required | ||
RODRIGO | 853 Spanish manuscript pages, (free) required | ||
Toebler-OCR | 1 | almost 6 years ago | (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch |
Awesome OCR / Literature / OCR-related publication and link lists | |||
IMPACT: Tools for text digitisation | List of tools software projects related, some related to OCR | ||
OCR-D | List of OCR-related academic articles in the context of the project | ||
Mendeley Group "OCR - Optical Character Recognition" | Collection of 34 papers on OCR | ||
eadh.org projects | List of Digital Humanities-related projects in Europe, some related to OCR | ||
Wikipedia: Comparison of optical character recognition software | |||
OCR [and Deep Learning] | by | ||
Ocropus Wiki: Publications | 3,422 | over 3 years ago | |
Awesome OCR / Literature / Blog Posts and Tutorials | |||
Tesseract Blends Old and New OCR Technology | 260 | about 3 years ago | (2016) |
What You Always Wanted To Know About Tesseract | (2014) | ||
Extracting text from an image using Ocropus | (2015) | ||
Training an Ocropus OCR model | (2015) | ||
Ocropus Wiki: Compute errors and confusions | 3,422 | over 3 years ago | (2016) |
Ocropus Wiki: Working with Ground Truth | 3,422 | over 3 years ago | (2016) |
OCRopus | (2016) | ||
10 Tips for making your OCR project succeed | (2013) | ||
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology | - | ||
Extracting Text from PDFs; Doing OCR; all within R | |||
Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R | |||
R programming environment | How to work with OCR from PDFs in the | ||
Awesome OCR / Literature / Blog Posts and Tutorials | |||
Tutorial: Command-line OCR on a Mac | |||
Practical Expercience with OCRopus Model Training | (2016) | ||
Homemade Manuscript OCR (1): OCRopy | (2017) | ||
Optimizing Binarization for OCRopus | (2017) | ||
Prototype demo for OCR postfix in Danish Newspapers | (2016) | ||
How Can I OCR My Dictionary? | (2016) | ||
"Needlessly complex" blog | (2016) . Several image processing how-tos (Python based), particularly: | ||
Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog | |||
Page dewarping | ( ) | ||
Compressing and enhancing hand-written notes | ( ) | ||
Unprojecting text with ellipses | ( ) | ||
Awesome OCR / Literature / Blog Posts and Tutorials | |||
(Open-Source-)OCR-Workflows | (2017) overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the project | ||
A gentle introduction to OCR | (2018) | ||
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR | (2019) A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts | ||
Awesome OCR / Literature / OCR Showcases | |||
abbyy-finereader-ocr-senate | 129 | over 8 years ago | Using OCR to parse scanned Senate Financial Disclosure forms |
cvOCR | 18 | about 8 years ago | An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract |
MathOCR | 167 | about 2 years ago | A printed scientific document recognition system, |
Awesome OCR / Literature / Academic articles | |||
High performance document layout analysis | (2003) Breuel | ||
Adaptive degraded document image binarization | (2006) Gatos, Pratikakis, Perantonis | ||
[Internship Report] | (2007) Gupta | ||
OCRopus Addons (Internship Report) | (2007) Dantrey | ||
Local Logistic Classifiers for Large Scale Learning | (2012) Yousefi, Breuel | ||
High Performance OCR for Printed English and Fraktur using LSTM Networks | (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait | ||
Can we build language-independent OCR using LSTM networks? | (2013) Ul-Hasan, Breuel | ||
Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks | (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel | ||
OCR of historical printings of Latin texts: Problems, Prospects, Progress. | (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink | ||
Correcting Noisy OCR: Context beats Confusion | (2014) Evershed, Fitch | ||
TypeWright: An Experiment in Participatory Curation | (2015) Bilansky | ||
Benchmarking of LSTM Networks | (2015) Breuel | ||
Recognition of Historical Greek Polytonic Scripts Using LSTM | (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki | ||
A Segmentation-Free Approach for Printed Devanagari Script Recognition | (2015) Karayil, Ul-Hasan, Breuel | ||
A Sequence Learning Approach for Multiple Script Identification | (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel | ||
Important New Developments in Arabographic Optical Character Recognition (OCR) | (2016) Romanov, Miller, Savant, Kiessling | ||
Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR) | |||
OpenArabic/OCR_GS_Data | 13 | over 7 years ago | using for ground truth data |
Awesome OCR / Literature / Academic articles | |||
OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus | (2016) Springmann, Lüdeling | ||
Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents | (2016) Springmann, Fink, Schulz | ||
Generic Text Recognition using Long Short-Term Memory Networks | (2016) Ul-Hasan -- Ph.D Thesis | ||
OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters | (2016) Dengel, Ul-Hasan, Bukhari | ||
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild | (2016) Lee, Osindero | ||
Telugu OCR Framework using Deep Learning | (2015/2017) , Hastie | ||
Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning | |||
TeluguOCR | see also , , , | ||
Awesome OCR / Literature / Academic articles | |||
A Two-Stage Method for Text Line Detection in Historical Documents | (2018) , Leifert, Strauß, Labahn. Code available at |