pdf-text-extraction-benchmark

Text extractor benchmark

Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

GitHub

65 stars
6 watching
11 forks
Language: TeX
last commit: about 4 years ago
arxivbenchmarkevaluationextractionpdftextext-extraction

Related projects:

Repository Description Stars
bikash/documentunderstanding Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. 96
steelthread/mimeograph A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities 28
leofcardoso/pdf2pdfocr A tool to extract text from PDFs and add a searchable layer to them 279
nikolamilosevic86/tabinout A framework for extracting information from tables in scientific literature using a rule-based approach. 42
nissl-lab/toxy A .NET framework for extracting text from various document formats across multiple platforms. 362
retextjs/retext-keywords A plugin to extract keywords and key phrases from text documents. 330
aeksco/aws-pdf-textract-pipeline A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services 164
eyurtsev/kor An open-source wrapper around LLMs to extract structured data from text 1,638
gunnarmorling/quarkus-pdf-extract A Quarkus-based microservice to extract text from PDF files 24
uglytoad/pdfpig A C# library for extracting and analyzing text from PDF files 1,794
tabulapdf/tabula-java Extracts tables from PDF files using Java 1,859
aymericbeaumet/squeeze A tool to extract relevant information from text 17
stephenbrannon/iocextractor Extracts and organizes Indicators of Compromise from unstructured text files into structured formats. 135
danburzo/hred Extracts data from HTML or XML documents to JSON using a CSS selector-like query language 70
philipjkim/goreadability Extracts readable content from web pages using Open Graph and traditional readability rules. 69