awesome-document-understanding
DU tech
A curated collection of resources and papers on Document Understanding technology
A curated list of resources for Document Understanding (DU) topic
1k stars
37 watching
152 forks
last commit: over 1 year ago
Linked from 2 awesome lists
awesomeawesome-listdeep-learningdocument-aidocument-analysisdocument-intelligencedocument-layout-analysisdocument-understandinginformation-extractionintelligent-processingkey-information-extractionmachine-learningnatural-language-processingnlpocrpdfpdf-documentsrobotic-process-automationrpaunstructured-data
Awesome Document Understanding | |||
Title of a publication / dataset / resource title | , [ ] | ||
Introduction / Papers | |||
DocILE Benchmark for Document Information Localization and Extraction | , [ ] [ ] [ ] | ||
Future paradigms of automated processing of business documents | |||
Research topics | |||
Key Information Extraction (KIE) | |||
Document Layout Analysis (DLA) | |||
Document Question Answering (DQA) | |||
Scientific Document Understanding (SDU) | |||
Optical Character Recogtion (OCR) | |||
Related | |||
Research topics / Related | |||
General | |||
Tabular Data Comprehension (TDC) | |||
Robotic Process Automation (RPA) | |||
Others / Resources | |||
The RVL-CDIP Dataset | dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class | ||
The Industry Documents Library | a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library | ||
Color Document Dataset | from the Intelligent Sensory Information Systems, University of Amsterdam | ||
The IIT CDIP Collection | dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents | ||
borb | 3,398 | 19 days ago | is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc) |
pawls | 390 | 6 months ago | PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document |
pdfplumber | 6,744 | 10 days ago | Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging |
Pdfminer.six | 5,955 | 4 months ago | Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data |
Layout Parser | 4,910 | 3 months ago | Layout Parser is a deep learning based tool for document image layout analysis tasks |
Tabulo | 198 | almost 2 years ago | Table extraction from images |
OCRmyPDF | 14,140 | 4 days ago | OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted |
PDFBox | 2,675 | 6 days ago | The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents |
PdfPig | 1,733 | 8 days ago | This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C# |
parsing-prickly-pdfs | Resources and worksheet for the NICAR 2016 workshop of the same name | ||
pdf-text-extraction-benchmark | 65 | about 4 years ago | PDF tools benchmark |
Born digital pdf scanner | 8 | about 4 years ago | checking if pdf is born-digital |
OpenContracts | 717 | 4 days ago | Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose |
deepdoctection | 2,588 | 5 days ago | doctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models |
pydoxtools | 77 | 3 months ago | Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy |
Others / Conferences, workshops | |||
2021 | [ , , ] | ||
2021 | Workshop on Document Intelligence (DI) [ , ] | ||
2021 | Financial Narrative Processing Workshop (FNP) [ , , ] | ||
2021 | Workshop on Economics and Natural Language Processing (ECONLP) [ , , ] | ||
2020 | INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [ , , ] | ||
ACM International Conference on AI in Finance (ICAIF) | |||
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services | |||
CVPR 2020 Workshop on Text and Documents in the Deep Learning Era | |||
KDD Workshop on Machine Learning in Finance (KDD MLF 2020) | |||
FinIR 2020: The First Workshop on Information Retrieval in Finance | |||
2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019) | |||
Document Understanding Conference (DUC 2007) | |||
The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021) | |||
First Workshop on Scholarly Document Processing (SDProc 2020) | |||
2020 | International Workshop on SCIentific DOCument Analysis (SCIDOCA) [ , , ] | ||
Others / Blogs | |||
A Survey of Document Understanding Models | , 2021 | ||
Document Form Extraction | , 2021 | ||
How to automate processes with unstructured data | , 2021 | ||
A Comprehensive Guide to OCR with RPA and Document Understanding | , 2021 | ||
Information Extraction from Receipts with Graph Convolutional Networks | , 2021 | ||
How to extract structured data from invoices | , 2021 | ||
Extracting Structured Data from Templatic Documents | , 2020 | ||
To apply AI for good, think form extraction | , 2020 | ||
UiPath Document Understanding Solution Architecture and Approach | , 2020 | ||
How Can I Automate Data Extraction from Complex Documents? | , 2020 | ||
LegalTech: Information Extraction in legal documents | , 2020 | ||
Others / Solutions | |||
Abby | |||
Accenture | |||
Amazon | |||
Microsoft | |||
Uipath | |||
Applica.ai | |||
Base64.ai | |||
Docstack | |||
Element AI | |||
Indico | |||
Instabase | |||
Konfuzio | |||
Metamaze | |||
Nanonets | |||
Rossum | |||
Silo | |||
Inspirations | |||
https://github.com/kba/awesome-ocr | 2,820 | 5 months ago | |
https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics | 625 | 16 days ago | |
https://github.com/icoxfog417/awesome-financial-nlp | 404 | almost 5 years ago | |
https://github.com/BobLd/DocumentLayoutAnalysis | 583 | about 1 year ago | |
https://github.com/bikash/DocumentUnderstanding | 96 | almost 2 years ago | |
https://github.com/harpribot/awesome-information-retrieval | 1,069 | over 1 year ago | |
https://github.com/roomylee/awesome-relation-extraction | 1,184 | almost 3 years ago | |
https://github.com/caufieldjh/awesome-bioie | 349 | 6 months ago | |
https://github.com/HelloRusk/entity-related-papers | 94 | 3 months ago | |
https://github.com/pliang279/awesome-multimodal-ml | 6,094 | 3 months ago | |
https://github.com/thunlp/LegalPapers | 466 | almost 4 years ago | |
https://github.com/heartexlabs/awesome-data-labeling | 3,803 | 5 months ago | |
https://github.com/jsbroks/awesome-dataset-tools | 856 | over 1 year ago | |
https://github.com/EthicalML/awesome-production-machine-learning | 17,606 | 4 days ago | |
https://github.com/eugeneyan/applied-ml | 27,322 | 4 months ago | |
https://github.com/awesomedata/awesome-public-datasets | 60,953 | 8 days ago | |
https://github.com/keon/awesome-nlp | 16,768 | about 1 year ago | |
https://github.com/thunlp/PLMpapers | 3,328 | about 2 years ago | |
https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists | 21,049 | 6 months ago | |
https://github.com/papers-we-love/papers-we-love | 88,242 | 13 days ago | |
https://github.com/BAILOOL/DoYouEvenLearn | 1,038 | over 2 years ago | |
https://github.com/hibayesian/awesome-automl-papers | 4,023 | 5 months ago |
More related projects:
- tleyden/open-ocr
- ibm/max-ocr
- dannnylo/tesseract-ocr-elixir
- waitingcheung/artrailer
- iuliaturc/detextify
- namuan/dr-doc-search
- robertmartin8/pyportfolioopt
- arocks/edge
- lyst/lightfm
- graphql-python/graphql-core
- bauerji/flask-pydantic
- orion-ai-lab/kurosiwo
- axiros/terminal_markdown_viewer
- pbkhrv/ulauncher-keepassxc