awesome-ocr
OCR toolkit
A curated list of OCR engines, tools, and formats for extracting text from images and documents.
Links to awesome OCR projects
3k stars
128 watching
352 forks
last commit: over 1 year ago
Linked from 4 awesome lists
Awesome OCR / Software / OCR engines | |||
| tesseract | 63,142 | 11 months ago | The definitive Open Source OCR engine |
| EasyOCR | 24,876 | about 1 year ago | OCR engine built on PyTorch by JaidedAI, |
| ocropus | 3,426 | over 4 years ago | OCR engine based on LSTM, |
| ocropus 0.4 | 17 | about 14 years ago | Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++ |
| kraken | 757 | 11 months ago | Ocropus fork with sane defaults |
| gocr | OCR engine under the GNU Public License led by Joerg Schulenburg | ||
| Ocrad | The GNU OCR | ||
| ocular | 256 | over 1 year ago | Machine-learning OCR for historic documents |
| SwiftOCR | 4,623 | almost 5 years ago | fast and simple OCR library written in Swift |
| attention-ocr | 1,079 | about 2 years ago | OCR engine using visual attention mechanisms |
| RWTH-OCR | The RWTH Aachen University Optical Character Recognition System | ||
| simple-ocr-opencv | 525 | almost 2 years ago | and its - A simple pythonic OCR engine using opencv and numpy |
| Calamari | 1,056 | 12 months ago | OCR Engine based on OCRopy and Kraken |
| doctr | 4,011 | 11 months ago | A seamless & high-performing OCR library powered by Deep Learning |
Awesome OCR / Software / Older and possibly abandoned OCR engines | |||
| Clara OCR | Open source OCR in C | ||
| Cuneiform | CuneiForm OCR was developed by Cognitive Technologies | ||
| Eye | an experimental Java OCR (image-to-text) application | ||
| kognition | An omnifont OCR software for KDE | ||
| OCRchie | Modular Optical Character Recognition Software | ||
| ocre | o.c.r. easy | ||
| xplab | A GTK 2 tool for pattern matching | ||
| hebOCR | 5 | over 9 years ago | Hebrew character recognition library (previously named hocr, see ) |
Awesome OCR / Software / OCR file formats | |||
| abby2hocr.xslt XSLT script | |||
| ocr-conversion-scripts | 72 | over 2 years ago | |
| hocr-tools | 373 | about 1 year ago | Tools for doing various useful things with hOCR files, |
| hocr-spec | 74 | about 1 year ago | hOCR 1.2 specification |
| ocr-transform | 182 | about 1 year ago | CLI tool to convert between hOCR and ALTO, |
| hocr-parser | 13 | about 10 years ago | hOCR Specification Python Parser |
| hOCRTools | 6 | over 7 years ago | hOCR to ALTO conversion XSLT |
| ALTO XML Schema | 52 | over 1 year ago | XML Schema and development of the ALTO XML format |
| ALTO XML Documentation | 39 | about 7 years ago | Documentation and use cases for ALTO |
| alto-tools | 40 | about 2 years ago | Various tools to work with ALTO files, Python |
| AbbyyToAlto | 9 | over 14 years ago | PHP script converting from Abbyy 6 to ALTO XML |
| TEI-OCR | 1 | over 9 years ago | TEI customization for OCR generated layout and content information |
| TEI SIG on Libraries | Best Practices for TEI in Libraries | ||
| GDZ | METS/TEI-based GDZ document format | ||
| PAGE-XML Schema | 66 | over 4 years ago | XML schema of the PAGE XML format along with documentation and examples |
| omni:us Pages Format (OPF) | XML schema very similar to PAGE XML that has some additional features | ||
| py-pagexml | 13 | about 1 year ago | Python library for handling PAGE XML and OPF files |
Awesome OCR / Software / OCR CLI | |||
| OCRmyPDF | 14,363 | 11 months ago | OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched |
| Pdf2PdfOCR | 279 | almost 2 years ago | A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported |
| Ocrocis | Project manager interface for Ocropy, see also | ||
| tesseract-recognize | 44 | over 1 year ago | Tesseract-based tool that outputs result in Page XML format ( ) |
Awesome OCR / Software / OCR GUI | |||
| moz-hocr-editor | 10 | over 10 years ago | Firefox Addon for editing hOCR files |
| qt-box-editor | 173 | about 1 year ago | QT4 editor of tesseract-ocr box files |
| ocr-gt-tools | 48 | almost 5 years ago | Client-Server application for editing OCR ground truth |
| Paperwork | 2,431 | over 7 years ago | Using scanners and OCR to grep paper documents the easy way |
| Paperless | 7,864 | over 4 years ago | Scan, index, and archive all of your paper documents |
| gImageReader | 1,653 | 11 months ago | gImageReader is a simple Gtk/Qt front-end to tesseract-ocr |
| VietOCR | A Java/.NET GUI frontend for Tesseract OCR engine, including a graphical Tesseract editor | ||
| PoCoTo | 40 | about 3 years ago | Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents |
| OCRFeeder | GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more | ||
| PRImA PAGE Viewer | 35 | over 2 years ago | Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR |
| LAREX | 181 | 11 months ago | A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books |
| archiscribe | 17 | over 7 years ago | Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at , results are available in |
| nw-page-editor | 30 | almost 2 years ago | Simple app for visual editing of Page XML files. Provides desktop and versions |
Awesome OCR / Software / OCR Preprocessing | |||
| NoiseRemove.java in MathOCR | 168 | almost 3 years ago | Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis |
| binarize.c in ZBar | 2,503 | over 1 year ago | C implementations of two binarization algorithms, based on Sauvola |
| typeface-corpus | 7 | almost 11 years ago | A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities |
| binarizewolfjolion | 30 | about 8 years ago | Comparison of binarization algorithms |
| crop_morphology.py in oldnyc | 289 | 11 months ago | Cropping a page to just the text block |
| Whiteboard Picture Cleaner | Shell one-liner/script to clean up and beautify photos of whiteboards | ||
| textcleaner | Fred's ImageMagick script - Processes a scanned document of text to clean the text background | ||
| localcontrast | Fast O(1) local contrast optimization | ||
Awesome OCR / Software / OCR as a Service | |||
| Open OCR | 1,346 | about 2 years ago | Run Tesseract in Docker containers |
| tesseract-web-service | 135 | over 2 years ago | An implementation of RESTful web service for tesseract-OCR using tornado |
| docker-ocropy | 9 | almost 8 years ago | A Docker container for running the |
| ABBYY Cloud OCR SDK Code samples | 504 | over 2 years ago | Code samples for using the proprietary commercial ABBYY OCR API |
| nidaba | 86 | almost 8 years ago | An expandable and scalable OCR pipeline |
| gamera | 39 | about 3 years ago | A meta-framework for building document processing applications, e.g. OCR |
| ocr-tools | 7 | over 4 years ago | Project to provide CLI and web service interfaces to common OCR engines |
| ocrad-docker | 2 | about 9 years ago | Run the OCR engine in a docker container |
| kraken-docker | 5 | almost 8 years ago | Run the OCR engine in a docker container |
| Konfuzio | Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see (code is not open) | ||
| ocr.space | Free Online OCR and OCR API by based on Tesseract (code is not open) | ||
| OCR4all | 244 | almost 2 years ago | Provides OCR services through web applications. Included Projects: , , and |
Awesome OCR / Software / OCR evaluation | |||
| ISRI OCR Evaluation Tools | with a | ||
Awesome OCR / Software / OCR evaluation / ISRI OCR Evaluation Tools | |||
| isri-ocr-evaluation-tools | 57 | over 4 years ago | further development by (2015, 2016) |
| ancientgreekocr-evaluation-tools | 22 | over 7 years ago | further development by (2013, 2014) |
Awesome OCR / Software / OCR evaluation | |||
| ocrevalUAtion | 67 | about 3 years ago | Cross-format evaluation, CLI and GUI |
| ngram-ocr-eval | 1 | over 11 years ago | Brute and simple OCR evaluation using ngrams |
| quack | 22 | almost 3 years ago | Quality-Assurance-tool for scans with corresponding ALTO-files |
Awesome OCR / Software / OCR libraries by programming language | |||
| tesseract-ocr | 13 | over 3 years ago | A Crystal wrapper for tesseract-ocr |
| tesseract_ocr | 55 | over 3 years ago | Elixir library wrapping the tesseract executable |
| gosseract | 2,751 | over 1 year ago | Golang OCR library, wrapping Tesseract-ocr |
| Tess4J | 1,619 | 12 months ago | Java Native Access bindings to Tesseract |
| tess-two | 3,761 | over 3 years ago | Tools for compiling Tesseract on Android and Java API |
| tesseract for .net | 2,308 | over 1 year ago | A .Net wrapper for tesseract-ocr |
| TTesseractOCR4 | 145 | over 2 years ago | Object Pascal binding for tesseract-ocr 4.x |
| Tesseract OCR for PHP | 2,897 | about 2 years ago | Tesseract PHP bindings |
| pytesseract | 5,919 | 12 months ago | A Python wrapper for Google Tesseract |
| pyocr | 930 | over 7 years ago | A Python wrapper for Tesseract and Cuneiform |
| ocrodjvu | 46 | about 3 years ago | A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract |
| tesserocr | 2,026 | 12 months ago | A Python wrapper for the tesseract-ocr API |
| ocracy | 37 | almost 11 years ago | pure javascript lstm rnn implementation based on ocropus |
| gocr.js | 98 | almost 12 years ago | Javascript port (emscripten) of gocr |
| ocrad.js | 3,494 | about 5 years ago | Javascript port (emscripten) of ocrad |
| tesseract.js | 35,553 | 11 months ago | Javascript port (emscripten) of Tesseract |
| node-tesseract-ocr | 308 | over 2 years ago | A simple wrapper for the Tesseract OCR package |
| node-tesseract-native | 51 | almost 7 years ago | C++ module for node providing OCR with tesseract and leptonica |
| rtesseract | 838 | about 2 years ago | Ruby library wrapping the tesseract and imagemagick executables |
| ruby-tesseract | 629 | over 8 years ago | Native Tesseract bindings for Ruby MRI and JRuby |
| ocr_space | 70 | almost 7 years ago | API wrapper for free ocr service ocr.space. Includes CLI |
| tesseract.rs | 148 | almost 2 years ago | Rust bindings for tesseract OCR |
| leptess | Productive and safe Rust bindings/wrappers for tesseract and leptonica | ||
| tesseract | 245 | about 1 year ago | R bindings for tesseract OCR |
| Tesseract OCR iOS | 4,220 | over 4 years ago | Swift and Objective-C wrapper for Tesseract OCR |
| SwiftOCR | 4,623 | almost 5 years ago | Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes |
Awesome OCR / Software / OCR training tools | |||
| glyph-miner | 34 | about 9 years ago | A system for extracting glyphs from early typeset prints |
| ocrodeg | 161 | over 5 years ago | Document image degradation for OCR data augmentation |
Awesome OCR / Datasets / Ground Truth | |||
| archiscribe-corpus | 8 | almost 7 years ago | >4,200 lines transcribed from 19th Century German prints via |
| CIS OCR Test Set | 15 | over 4 years ago | 2 example documents each in German/Latin/Greek with ground truth for |
| Rescribe | 11 | about 3 years ago | Transcriptions of Caroline Minuscule Manuscripts |
| CLTK | Corpora from | ||
| DIVA-HisDB | 150 pages of three medieval manuscripts | ||
| EarlyPrintedBooks | 10 | almost 8 years ago | ~8,800 lines from several early printed books |
| EEBO-TCP | 18 | over 4 years ago | 25,363 EEBO documents transcribed by |
| ECCO-TCP | 18 | over 4 years ago | 2,188 ECCO documents transcribed by |
| eMOP-TCP | 3 | almost 10 years ago | 2,188 ECCO-TCP documents, cleaned up by |
| Evans-TCP | 18 | over 4 years ago | 4,977 Evans documents transcribed by |
| FDHN | Finnish Digitised Historical Newspapers, , (free) required, | ||
| FROC-MSS | 0 | almost 7 years ago | 4 Old French Medieval Manuscripts |
| GERMANA | 764 Spanish manuscript pages, (free) required | ||
| GT4HistOCR | Ground Truth for German Fraktur and Early Modern Latin | ||
| imagessan | 4 | about 7 years ago | Sanskrit images & ground truth (Devanagari script) |
| IMPACT-BHL | 2,418 pages from the Biodiversity Heritage Library, | ||
| IMPACT-BL | 294 pages from the British Library, (free) required | ||
| IMPACT-BNE | 215 pages from the National Library of Spain, (free) required, | ||
| IMPACT-BNF | 151 pages from the National Library of France, (free) required | ||
| IMPACT-KB | 142 pages from the National Library of the Netherlands | ||
| IMPACT-NKC | 187 pages from the Czech National Library, (free) required | ||
| IMPACT-NLB | 19 pages from the National Library of Bulgaria, (free) required | ||
| IMPACT-NUK | 209 pages from the National Library of Slovenia, (free) required | ||
| IMPACT-PSNC | 478 pages from four Polish digital libraries, | ||
| LascivaRoma/lexical | 1 | over 2 years ago | Transcription of 19th century lexical resources for Latin learning |
| MJSynth | 9m synthetic images covering 90k English words | ||
| OCR19thSAC | 19,000 pages Swiss Alpine Club yearbooks transcribed via | ||
| OCR-D | 180 pages of German historical prints from | ||
| OCR_GS_Data | 15 | almost 3 years ago | Double-checked Arabic Gold Standard from |
| old-books | 12 | about 8 years ago | 322 old books from |
| PRImA-ENP | 528 pages historic newspapers from , (free) required | ||
| RODRIGO | 853 Spanish manuscript pages, (free) required | ||
| Toebler-OCR | 1 | almost 7 years ago | (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch |
Awesome OCR / Literature / OCR-related publication and link lists | |||
| IMPACT: Tools for text digitisation | List of tools software projects related, some related to OCR | ||
| OCR-D | List of OCR-related academic articles in the context of the project | ||
| Mendeley Group "OCR - Optical Character Recognition" | Collection of 34 papers on OCR | ||
| eadh.org projects | List of Digital Humanities-related projects in Europe, some related to OCR | ||
| Wikipedia: Comparison of optical character recognition software | |||
| OCR [and Deep Learning] | by | ||
| Ocropus Wiki: Publications | 3,426 | over 4 years ago | |
Awesome OCR / Literature / Blog Posts and Tutorials | |||
| Tesseract Blends Old and New OCR Technology | 262 | about 4 years ago | (2016) |
| What You Always Wanted To Know About Tesseract | (2014) | ||
| Extracting text from an image using Ocropus | (2015) | ||
| Training an Ocropus OCR model | (2015) | ||
| Ocropus Wiki: Compute errors and confusions | 3,426 | over 4 years ago | (2016) |
| Ocropus Wiki: Working with Ground Truth | 3,426 | over 4 years ago | (2016) |
| OCRopus | (2016) | ||
| 10 Tips for making your OCR project succeed | (2013) | ||
| Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology | - | ||
| Extracting Text from PDFs; Doing OCR; all within R | |||
Awesome OCR / Literature / Blog Posts and Tutorials / Extracting Text from PDFs; Doing OCR; all within R | |||
| R programming environment | How to work with OCR from PDFs in the | ||
Awesome OCR / Literature / Blog Posts and Tutorials | |||
| Tutorial: Command-line OCR on a Mac | |||
| Practical Expercience with OCRopus Model Training | (2016) | ||
| Homemade Manuscript OCR (1): OCRopy | (2017) | ||
| Optimizing Binarization for OCRopus | (2017) | ||
| Prototype demo for OCR postfix in Danish Newspapers | (2016) | ||
| How Can I OCR My Dictionary? | (2016) | ||
| "Needlessly complex" blog | (2016) . Several image processing how-tos (Python based), particularly: | ||
Awesome OCR / Literature / Blog Posts and Tutorials / "Needlessly complex" blog | |||
| Page dewarping | ( ) | ||
| Compressing and enhancing hand-written notes | ( ) | ||
| Unprojecting text with ellipses | ( ) | ||
Awesome OCR / Literature / Blog Posts and Tutorials | |||
| (Open-Source-)OCR-Workflows | (2017) overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the project | ||
| A gentle introduction to OCR | (2018) | ||
| Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR | (2019) A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts | ||
Awesome OCR / Literature / OCR Showcases | |||
| abbyy-finereader-ocr-senate | 129 | over 9 years ago | Using OCR to parse scanned Senate Financial Disclosure forms |
| cvOCR | 18 | about 9 years ago | An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract |
| MathOCR | 168 | almost 3 years ago | A printed scientific document recognition system, |
Awesome OCR / Literature / Academic articles | |||
| High performance document layout analysis | (2003) Breuel | ||
| Adaptive degraded document image binarization | (2006) Gatos, Pratikakis, Perantonis | ||
| [Internship Report] | (2007) Gupta | ||
| OCRopus Addons (Internship Report) | (2007) Dantrey | ||
| Local Logistic Classifiers for Large Scale Learning | (2012) Yousefi, Breuel | ||
| High Performance OCR for Printed English and Fraktur using LSTM Networks | (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait | ||
| Can we build language-independent OCR using LSTM networks? | (2013) Ul-Hasan, Breuel | ||
| Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks | (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel | ||
| OCR of historical printings of Latin texts: Problems, Prospects, Progress. | (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink | ||
| Correcting Noisy OCR: Context beats Confusion | (2014) Evershed, Fitch | ||
| TypeWright: An Experiment in Participatory Curation | (2015) Bilansky | ||
| Benchmarking of LSTM Networks | (2015) Breuel | ||
| Recognition of Historical Greek Polytonic Scripts Using LSTM | (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki | ||
| A Segmentation-Free Approach for Printed Devanagari Script Recognition | (2015) Karayil, Ul-Hasan, Breuel | ||
| A Sequence Learning Approach for Multiple Script Identification | (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel | ||
| Important New Developments in Arabographic Optical Character Recognition (OCR) | (2016) Romanov, Miller, Savant, Kiessling | ||
Awesome OCR / Literature / Academic articles / Important New Developments in Arabographic Optical Character Recognition (OCR) | |||
| OpenArabic/OCR_GS_Data | 13 | over 8 years ago | using for ground truth data |
Awesome OCR / Literature / Academic articles | |||
| OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus | (2016) Springmann, Lüdeling | ||
| Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents | (2016) Springmann, Fink, Schulz | ||
| Generic Text Recognition using Long Short-Term Memory Networks | (2016) Ul-Hasan -- Ph.D Thesis | ||
| OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters | (2016) Dengel, Ul-Hasan, Bukhari | ||
| Recursive Recurrent Nets with Attention Modeling for OCR in the Wild | (2016) Lee, Osindero | ||
| Telugu OCR Framework using Deep Learning | (2015/2017) , Hastie | ||
Awesome OCR / Literature / Academic articles / Telugu OCR Framework using Deep Learning | |||
| TeluguOCR | see also , , , | ||
Awesome OCR / Literature / Academic articles | |||
| A Two-Stage Method for Text Line Detection in Historical Documents | (2018) , Leifert, Strauß, Labahn. Code available at | ||