pdf-text-extraction-benchmark
Text extractor benchmark
Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
65 stars
6 watching
11 forks
Language: TeX
last commit: about 4 years ago arxivbenchmarkevaluationextractionpdftextext-extraction
Related projects:
Repository | Description | Stars |
---|---|---|
bikash/documentunderstanding | Research and development of tools and techniques for extracting information from images and PDFs using deep learning and graph neural networks. | 96 |
steelthread/mimeograph | A CoffeeScript library for extracting text from PDFs and creating searchable files | 28 |
leofcardoso/pdf2pdfocr | A tool to extract text from PDFs and add a searchable layer to them | 274 |
nikolamilosevic86/tabinout | A framework for extracting information from tables in scientific literature using a rule-based approach. | 41 |
nissl-lab/toxy | A .NET framework for extracting text from various document formats across multiple platforms. | 359 |
retextjs/retext-keywords | A plugin to extract keywords and key phrases from text documents. | 327 |
aeksco/aws-pdf-textract-pipeline | A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services | 164 |
eyurtsev/kor | Extracts structured data from unstructured text using large language models | 1,629 |
gunnarmorling/quarkus-pdf-extract | A Quarkus-based microservice to extract text from PDF files | 24 |
uglytoad/pdfpig | A C# library for extracting and analyzing text from PDF files | 1,733 |
tabulapdf/tabula-java | Extracts tables from PDF files using Java | 1,843 |
aymericbeaumet/squeeze | A tool to extract relevant information from text | 17 |
stephenbrannon/iocextractor | Extracts and organizes Indicators of Compromise from unstructured text files into structured formats. | 135 |
danburzo/hred | Extracts data from HTML or XML documents to JSON using a CSS selector-like query language | 69 |
philipjkim/goreadability | Extracts readable content from web pages using Open Graph and traditional readability rules. | 69 |