awesome-document-understanding

A curated list of resources for Document Understanding (DU) topic

GitHub

1k stars
36 watching
145 forks
last commit: over 1 year ago
Linked from 2 awesome lists

awesomeawesome-listdeep-learningdocument-aidocument-analysisdocument-intelligencedocument-layout-analysisdocument-understandinginformation-extractionintelligent-processingkey-information-extractionmachine-learningnatural-language-processingnlpocrpdfpdf-documentsrobotic-process-automationrpaunstructured-data

Awesome Document Understanding

Title of a publication / dataset / resource title , [ ]

Introduction / Papers

DocILE Benchmark for Document Information Localization and Extraction , [ ] [ ] [ ]
Future paradigms of automated processing of business documents

Research topics

Key Information Extraction (KIE)
Document Layout Analysis (DLA)
Document Question Answering (DQA)
Scientific Document Understanding (SDU)
Optical Character Recogtion (OCR)
Related
General
Tabular Data Comprehension (TDC)
Robotic Process Automation (RPA)

Others / Resources

The RVL-CDIP Dataset dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
The Industry Documents Library a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
Color Document Dataset from the Intelligent Sensory Information Systems, University of Amsterdam
The IIT CDIP Collection dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents
borb 3,373 about 1 month ago is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc)
pawls 385 5 months ago PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
pdfplumber 6,454 about 1 month ago Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
Pdfminer.six 5,848 2 months ago Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
Layout Parser 4,805 about 2 months ago Layout Parser is a deep learning based tool for document image layout analysis tasks
Tabulo 199 almost 2 years ago Table extraction from images
OCRmyPDF 13,699 20 days ago OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
PDFBox 2,633 2 days ago The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
PdfPig 1,664 26 days ago This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
parsing-prickly-pdfs Resources and worksheet for the NICAR 2016 workshop of the same name
pdf-text-extraction-benchmark 63 almost 4 years ago PDF tools benchmark
Born digital pdf scanner 7 about 4 years ago checking if pdf is born-digital
OpenContracts 684 5 days ago Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose
deepdoctection 2,519 17 days ago doctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models
pydoxtools 71 about 1 month ago Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy

Others / Conferences, workshops

2021 [ , , ]
2021 Workshop on Document Intelligence (DI) [ , ]
2021 Financial Narrative Processing Workshop (FNP) [ , , ]
2021 Workshop on Economics and Natural Language Processing (ECONLP) [ , , ]
2020 INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [ , , ]
ACM International Conference on AI in Finance (ICAIF)
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
FinIR 2020: The First Workshop on Information Retrieval in Finance
2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
Document Understanding Conference (DUC 2007)
The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
First Workshop on Scholarly Document Processing (SDProc 2020)
2020 International Workshop on SCIentific DOCument Analysis (SCIDOCA) [ , , ]

Others / Blogs

A Survey of Document Understanding Models , 2021
Document Form Extraction , 2021
How to automate processes with unstructured data , 2021
A Comprehensive Guide to OCR with RPA and Document Understanding , 2021
Information Extraction from Receipts with Graph Convolutional Networks , 2021
How to extract structured data from invoices , 2021
Extracting Structured Data from Templatic Documents , 2020
To apply AI for good, think form extraction , 2020
UiPath Document Understanding Solution Architecture and Approach , 2020
How Can I Automate Data Extraction from Complex Documents? , 2020
LegalTech: Information Extraction in legal documents , 2020

Others / Solutions

Abby
Accenture
Amazon
Google
Microsoft
Uipath
Applica.ai
Base64.ai
Docstack
Element AI
Indico
Instabase
Konfuzio
Metamaze
Nanonets
Rossum
Silo

Inspirations

https://github.com/kba/awesome-ocr 2,761 3 months ago
https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics 606 27 days ago
https://github.com/icoxfog417/awesome-financial-nlp 402 over 4 years ago
https://github.com/BobLd/DocumentLayoutAnalysis 572 about 1 year ago
https://github.com/bikash/DocumentUnderstanding 95 almost 2 years ago
https://github.com/harpribot/awesome-information-retrieval 1,049 over 1 year ago
https://github.com/roomylee/awesome-relation-extraction 1,178 over 2 years ago
https://github.com/caufieldjh/awesome-bioie 337 4 months ago
https://github.com/HelloRusk/entity-related-papers 90 about 1 month ago
https://github.com/pliang279/awesome-multimodal-ml 5,931 about 2 months ago
https://github.com/thunlp/LegalPapers 463 over 3 years ago
https://github.com/heartexlabs/awesome-data-labeling 3,724 4 months ago
https://github.com/jsbroks/awesome-dataset-tools 840 over 1 year ago
https://github.com/EthicalML/awesome-production-machine-learning 17,404 4 days ago
https://github.com/eugeneyan/applied-ml 27,189 3 months ago
https://github.com/awesomedata/awesome-public-datasets 60,356 29 days ago
https://github.com/keon/awesome-nlp 16,523 11 months ago
https://github.com/thunlp/PLMpapers 3,319 almost 2 years ago
https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists 20,811 5 months ago
https://github.com/papers-we-love/papers-we-love 87,246 3 days ago
https://github.com/BAILOOL/DoYouEvenLearn 1,033 over 2 years ago
https://github.com/hibayesian/awesome-automl-papers 3,998 4 months ago

Backlinks from these awesome lists: