awesome-document-understanding

DU tech

A curated collection of resources and papers on Document Understanding technology

A curated list of resources for Document Understanding (DU) topic

GitHub

1k stars

37 watching

152 forks

last commit: about 3 years ago

Linked from 2 awesome lists

awesomeawesome-listdeep-learningdocument-aidocument-analysisdocument-intelligencedocument-layout-analysisdocument-understandinginformation-extractionintelligent-processingkey-information-extractionmachine-learningnatural-language-processingnlpocrpdfpdf-documentsrobotic-process-automationrpaunstructured-data

Awesome Document Understanding
Title of a publication / dataset / resource title			, [ ]
Introduction / Papers
DocILE Benchmark for Document Information Localization and Extraction			, [ ] [ ] [ ]
Future paradigms of automated processing of business documents
Research topics
Key Information Extraction (KIE)
Document Layout Analysis (DLA)
Document Question Answering (DQA)
Scientific Document Understanding (SDU)
Optical Character Recogtion (OCR)
Related
Research topics / Related
General
Tabular Data Comprehension (TDC)
Robotic Process Automation (RPA)
Others / Resources
The RVL-CDIP Dataset			dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
The Industry Documents Library			a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
Color Document Dataset			from the Intelligent Sensory Information Systems, University of Amsterdam
The IIT CDIP Collection			dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents
borb	3,413	over 1 year ago	is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc)
pawls	397	about 2 years ago	PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
pdfplumber	6,898	over 1 year ago	Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
Pdfminer.six	6,046	almost 2 years ago	Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
Layout Parser	4,962	almost 2 years ago	Layout Parser is a deep learning based tool for document image layout analysis tasks
Tabulo	198	over 3 years ago	Table extraction from images
OCRmyPDF	14,363	over 1 year ago	OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
PDFBox	2,700	over 1 year ago	The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
PdfPig	1,794	over 1 year ago	This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
parsing-prickly-pdfs			Resources and worksheet for the NICAR 2016 workshop of the same name
pdf-text-extraction-benchmark	65	over 5 years ago	PDF tools benchmark
Born digital pdf scanner	8	almost 6 years ago	checking if pdf is born-digital
OpenContracts	728	over 1 year ago	Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose
deepdoctection	2,628	over 1 year ago	doctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models
pydoxtools	78	almost 2 years ago	Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy
Others / Conferences, workshops
2021			[ , , ]
2021			Workshop on Document Intelligence (DI) [ , ]
2021			Financial Narrative Processing Workshop (FNP) [ , , ]
2021			Workshop on Economics and Natural Language Processing (ECONLP) [ , , ]
2020			INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [ , , ]
ACM International Conference on AI in Finance (ICAIF)
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
FinIR 2020: The First Workshop on Information Retrieval in Finance
2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
Document Understanding Conference (DUC 2007)
The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
First Workshop on Scholarly Document Processing (SDProc 2020)
2020			International Workshop on SCIentific DOCument Analysis (SCIDOCA) [ , , ]
Others / Blogs
A Survey of Document Understanding Models			, 2021
Document Form Extraction			, 2021
How to automate processes with unstructured data			, 2021
A Comprehensive Guide to OCR with RPA and Document Understanding			, 2021
Information Extraction from Receipts with Graph Convolutional Networks			, 2021
How to extract structured data from invoices			, 2021
Extracting Structured Data from Templatic Documents			, 2020
To apply AI for good, think form extraction			, 2020
UiPath Document Understanding Solution Architecture and Approach			, 2020
How Can I Automate Data Extraction from Complex Documents?			, 2020
LegalTech: Information Extraction in legal documents			, 2020
Others / Solutions
Abby
Accenture
Amazon
Google
Microsoft
Uipath
Applica.ai
Base64.ai
Docstack
Element AI
Indico
Instabase
Konfuzio
Metamaze
Nanonets
Rossum
Silo
Inspirations
https://github.com/kba/awesome-ocr	2,843	about 2 years ago
https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics	629	over 1 year ago
https://github.com/icoxfog417/awesome-financial-nlp	405	over 6 years ago
https://github.com/BobLd/DocumentLayoutAnalysis	591	almost 3 years ago
https://github.com/bikash/DocumentUnderstanding	96	over 3 years ago
https://github.com/harpribot/awesome-information-retrieval	1,076	about 3 years ago
https://github.com/roomylee/awesome-relation-extraction	1,186	over 4 years ago
https://github.com/caufieldjh/awesome-bioie	353	about 2 years ago
https://github.com/HelloRusk/entity-related-papers	94	almost 2 years ago
https://github.com/pliang279/awesome-multimodal-ml	6,151	almost 2 years ago
https://github.com/thunlp/LegalPapers	469	over 5 years ago
https://github.com/heartexlabs/awesome-data-labeling	3,834	about 2 years ago
https://github.com/jsbroks/awesome-dataset-tools	859	about 3 years ago
https://github.com/EthicalML/awesome-production-machine-learning	17,721	over 1 year ago
https://github.com/eugeneyan/applied-ml	27,407	almost 2 years ago
https://github.com/awesomedata/awesome-public-datasets	61,377	over 1 year ago
https://github.com/keon/awesome-nlp	16,830	over 2 years ago
https://github.com/thunlp/PLMpapers	3,331	over 3 years ago
https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists	21,139	about 2 years ago
https://github.com/papers-we-love/papers-we-love	88,844	over 1 year ago
https://github.com/BAILOOL/DoYouEvenLearn	1,039	over 4 years ago
https://github.com/hibayesian/awesome-automl-papers	4,035	about 2 years ago

awesome-document-understanding

Awesome Document Understanding

Introduction / Papers

Research topics

Others / Resources

Others / Conferences, workshops

Others / Blogs

Others / Solutions

Inspirations

Backlinks from these awesome lists:

More related projects:

awesome-document-understanding

Awesome Document Understanding

Introduction / Papers

Research topics

Research topics / Related

Others / Resources

Others / Conferences, workshops

Others / Blogs

Others / Solutions

Inspirations

Backlinks from these awesome lists:

More related projects: