pdf-text-extraction-benchmark

Text extractor benchmark

Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

GitHub

65 stars
6 watching
11 forks
Language: TeX
last commit: about 4 years ago
arxivbenchmarkevaluationextractionpdftextext-extraction

Related projects:

Repository Description Stars
bikash/documentunderstanding Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. 96
steelthread/mimeograph A CoffeeScript library for extracting text from PDFs and creating searchable files 28
leofcardoso/pdf2pdfocr A tool to extract text from PDFs and add a searchable layer to them 274
nikolamilosevic86/tabinout A framework for extracting information from tables in scientific literature using a rule-based approach. 41
nissl-lab/toxy A .NET framework for extracting text from various document formats across multiple platforms. 359
retextjs/retext-keywords A plugin to extract keywords and key phrases from text documents. 327
aeksco/aws-pdf-textract-pipeline A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services 164
eyurtsev/kor Extracts structured data from unstructured text using large language models 1,629
gunnarmorling/quarkus-pdf-extract A Quarkus-based microservice to extract text from PDF files 24
uglytoad/pdfpig A C# library for extracting and analyzing text from PDF files 1,733
tabulapdf/tabula-java Extracts tables from PDF files using Java 1,843
aymericbeaumet/squeeze A tool to extract relevant information from text 17
stephenbrannon/iocextractor Extracts and organizes Indicators of Compromise from unstructured text files into structured formats. 135
danburzo/hred Extracts data from HTML or XML documents to JSON using a CSS selector-like query language 69
philipjkim/goreadability Extracts readable content from web pages using Open Graph and traditional readability rules. 69