pdf-text-extraction-benchmark
Text extractor benchmark
Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
65 stars
6 watching
11 forks
Language: TeX
last commit: about 4 years ago arxivbenchmarkevaluationextractionpdftextext-extraction
Related projects:
Repository | Description | Stars |
---|---|---|
bikash/documentunderstanding | Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. | 96 |
steelthread/mimeograph | A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities | 28 |
leofcardoso/pdf2pdfocr | A tool to extract text from PDFs and add a searchable layer to them | 279 |
nikolamilosevic86/tabinout | A framework for extracting information from tables in scientific literature using a rule-based approach. | 42 |
nissl-lab/toxy | A .NET framework for extracting text from various document formats across multiple platforms. | 362 |
retextjs/retext-keywords | A plugin to extract keywords and key phrases from text documents. | 330 |
aeksco/aws-pdf-textract-pipeline | A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services | 164 |
eyurtsev/kor | An open-source wrapper around LLMs to extract structured data from text | 1,638 |
gunnarmorling/quarkus-pdf-extract | A Quarkus-based microservice to extract text from PDF files | 24 |
uglytoad/pdfpig | A C# library for extracting and analyzing text from PDF files | 1,794 |
tabulapdf/tabula-java | Extracts tables from PDF files using Java | 1,859 |
aymericbeaumet/squeeze | A tool to extract relevant information from text | 17 |
stephenbrannon/iocextractor | Extracts and organizes Indicators of Compromise from unstructured text files into structured formats. | 135 |
danburzo/hred | Extracts data from HTML or XML documents to JSON using a CSS selector-like query language | 70 |
philipjkim/goreadability | Extracts readable content from web pages using Open Graph and traditional readability rules. | 69 |