old-books-dataset

Book datasets

A collection of scanned book pages with ground truth annotations for OCR research and text analysis

Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.

GitHub

12 stars
2 watching
2 forks
Language: HTML
last commit: about 7 years ago
Linked from 1 awesome list

binarizationbinarized-datasetbooks-datasetdatasetground-truthgroundtruthocr-databaseocr-datasetold-booksold-documentstexttext-datatext-database

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
chreul/ocr_testdata_earlyprintedbooks Provides test data and models for training Optical Character Recognition (OCR) systems on historical printed books. 10
texworld/betterbib A collection of command-line tools to help manage and format bibliographic data. 817
openarabic/ocr_gs_data A collection of double-checked gold standard data for training and testing OCR engines. 13
ponteineptique/toebler-ocr An OCR project using historical French book data to train models and generate transcriptions. 1
yusuftaufiq/laravel-books-api A Laravel-based RESTful API to manage book data scraped from Gramedia 67
jbaiter/archiscribe-corpus A repository of transcribed 19th century German texts from various sources. 8
dativebase/old Software for creating collaborative databases of language data 1
gopherdata/resources A collection of Go-based resources and tools for data science tasks 876
ymcui/cmrc2018 A collection of data for evaluating Chinese machine reading comprehension systems 415
tbrugz/ribge A package for downloading and manipulating data from IBGE's open datasets in Brazil. 57
arthur151/relative_human Provides a toolbox for loading, visualizing, and evaluating a dataset of images with human annotations, including depth layers and age group classification. 138
bndr/gotabulate A library that generates pretty-printed tabular data from various input formats 334
rucaibox/recsysdatasets A repository of public data sources for Recommender Systems. 856
cidree/forestdata A package providing easy access to forestry and land use datasets. 13
kakaobrain/coyo-dataset A large-scale image-text pair dataset designed to support training of foundation models in computer vision and natural language processing. 1,163