pdf-text-extraction-benchmark

Text extractor benchmark

Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

GitHub

65 stars

6 watching

11 forks

Language: TeX

last commit: over 5 years ago

arxivbenchmarkevaluationextractionpdftextext-extraction

Related projects:

Repository	Description	Stars
bikash/documentunderstanding	Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks.	96
steelthread/mimeograph	A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities	28
leofcardoso/pdf2pdfocr	A tool to extract text from PDFs and add a searchable layer to them	279
nikolamilosevic86/tabinout	A framework for extracting information from tables in scientific literature using a rule-based approach.	42
nissl-lab/toxy	A .NET framework for extracting text from various document formats across multiple platforms.	362
retextjs/retext-keywords	A plugin to extract keywords and key phrases from text documents.	330
aeksco/aws-pdf-textract-pipeline	A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services	164
eyurtsev/kor	An open-source wrapper around LLMs to extract structured data from text	1,638
gunnarmorling/quarkus-pdf-extract	A Quarkus-based microservice to extract text from PDF files	24
uglytoad/pdfpig	A C# library for extracting and analyzing text from PDF files	1,794
tabulapdf/tabula-java	Extracts tables from PDF files using Java	1,859
aymericbeaumet/squeeze	A tool to extract relevant information from text	17
stephenbrannon/iocextractor	Extracts and organizes Indicators of Compromise from unstructured text files into structured formats.	135
danburzo/hred	Extracts data from HTML or XML documents to JSON using a CSS selector-like query language	70
philipjkim/goreadability	Extracts readable content from web pages using Open Graph and traditional readability rules.	69