aws-pdf-textract-pipeline

PDF extractor

A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services

mag Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

GitHub

164 stars
3 watching
18 forks
Language: TypeScript
last commit: 6 months ago
Linked from 1 awesome list

awsaws-cdkaws-textractcdkcloudformationdata-pipelinedynamodbjestlambdapdfpuppeteers3serverlesssnstextracttypescriptwebscraping

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
steelthread/mimeograph A CoffeeScript library for extracting text from PDFs and creating searchable files 28
fourdigits/wagtail_textract A Django package that enhances Wagtail's document search with text extraction capabilities using Tesseract and Textract libraries. 33
leofcardoso/pdf2pdfocr A tool to extract text from PDFs and add a searchable layer to them 274
ckorzen/pdf-text-extraction-benchmark Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles 65
uglytoad/pdfpig A C# library for extracting and analyzing text from PDF files 1,733
aymericbeaumet/squeeze A tool to extract relevant information from text 17
gamallo/galextra A multi-language term extractor that uses morphosyntax tagging and filtering to identify multi-word terms from plain text input. 2
tabulapdf/tabula-java Extracts tables from PDF files using Java 1,843
malfrats/xeuledoc A tool to fetch information about public Google documents from various services 846
kevbite/kevsoft.pdftk A .NET library that uses the PDFtk binary to manipulate and process PDF files 37
tecracer/cdk-templates Provides reusable templates and tools for deploying AWS CDK applications 118
jgranstrom/sass-extract Extracts structured variables from Sass files and makes them available in JavaScript for use in styles or dynamic content. 186
ravsii/textra A package that extracts and works with Go struct fields as values, including type information. 6
gunnarmorling/quarkus-pdf-extract A Quarkus-based microservice to extract text from PDF files 24
idea-fasoc/datasheet-scrubber Automates extraction of key circuit information from PDF datasheets/documents to build a database of commercial off-the-shelf IP. 51