pdf-text-extraction-benchmark
Text extractor benchmark
Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
65 stars
6 watching
11 forks
Language: TeX
last commit: over 4 years ago arxivbenchmarkevaluationextractionpdftextext-extraction
Related projects:
Repository | Description | Stars |
---|---|---|
| Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. | 96 |
| A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities | 28 |
| A tool to extract text from PDFs and add a searchable layer to them | 279 |
| A framework for extracting information from tables in scientific literature using a rule-based approach. | 42 |
| A .NET framework for extracting text from various document formats across multiple platforms. | 362 |
| A plugin to extract keywords and key phrases from text documents. | 330 |
| A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services | 164 |
| An open-source wrapper around LLMs to extract structured data from text | 1,638 |
| A Quarkus-based microservice to extract text from PDF files | 24 |
| A C# library for extracting and analyzing text from PDF files | 1,794 |
| Extracts tables from PDF files using Java | 1,859 |
| A tool to extract relevant information from text | 17 |
| Extracts and organizes Indicators of Compromise from unstructured text files into structured formats. | 135 |
| Extracts data from HTML or XML documents to JSON using a CSS selector-like query language | 70 |
| Extracts readable content from web pages using Open Graph and traditional readability rules. | 69 |