old-books-dataset
Book datasets
A collection of scanned book pages with ground truth annotations for OCR research and text analysis
Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.
12 stars
2 watching
2 forks
Language: HTML
last commit: about 7 years ago
Linked from 1 awesome list
binarizationbinarized-datasetbooks-datasetdatasetground-truthgroundtruthocr-databaseocr-datasetold-booksold-documentstexttext-datatext-database
Related projects:
Repository | Description | Stars |
---|---|---|
chreul/ocr_testdata_earlyprintedbooks | Provides test data and models for training Optical Character Recognition (OCR) systems on historical printed books. | 10 |
texworld/betterbib | A collection of command-line tools to help manage and format bibliographic data. | 817 |
openarabic/ocr_gs_data | A collection of double-checked gold standard data for training and testing OCR engines. | 13 |
ponteineptique/toebler-ocr | An OCR project using historical French book data to train models and generate transcriptions. | 1 |
yusuftaufiq/laravel-books-api | A Laravel-based RESTful API to manage book data scraped from Gramedia | 67 |
jbaiter/archiscribe-corpus | A repository of transcribed 19th century German texts from various sources. | 8 |
dativebase/old | Software for creating collaborative databases of language data | 1 |
gopherdata/resources | A collection of Go-based resources and tools for data science tasks | 876 |
ymcui/cmrc2018 | A collection of data for evaluating Chinese machine reading comprehension systems | 415 |
tbrugz/ribge | A package for downloading and manipulating data from IBGE's open datasets in Brazil. | 57 |
arthur151/relative_human | Provides a toolbox for loading, visualizing, and evaluating a dataset of images with human annotations, including depth layers and age group classification. | 138 |
bndr/gotabulate | A library that generates pretty-printed tabular data from various input formats | 334 |
rucaibox/recsysdatasets | A repository of public data sources for Recommender Systems. | 856 |
cidree/forestdata | A package providing easy access to forestry and land use datasets. | 13 |
kakaobrain/coyo-dataset | A large-scale image-text pair dataset designed to support training of foundation models in computer vision and natural language processing. | 1,163 |