aws-pdf-textract-pipeline

PDF extractor

A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services

Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

GitHub

164 stars

3 watching

18 forks

Language: TypeScript

last commit: about 2 years ago

Linked from 1 awesome list

awsaws-cdkaws-textractcdkcloudformationdata-pipelinedynamodbjestlambdapdfpuppeteers3serverlesssnstextracttypescriptwebscraping

Backlinks from these awesome lists:

kalaiser/awesome-cdk

Related projects:

Repository	Description	Stars
steelthread/mimeograph	A CoffeeScript library for extracting text from PDF files and creating searchable documents with OCR capabilities	28
fourdigits/wagtail_textract	A Django package that enhances Wagtail's document search with text extraction capabilities using Tesseract and Textract libraries.	33
leofcardoso/pdf2pdfocr	A tool to extract text from PDFs and add a searchable layer to them	279
ckorzen/pdf-text-extraction-benchmark	Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles	65
uglytoad/pdfpig	A C# library for extracting and analyzing text from PDF files	1,794
aymericbeaumet/squeeze	A tool to extract relevant information from text	17
gamallo/galextra	A multi-language term extractor that uses morphosyntax tagging and filtering to identify multi-word terms from plain text input.	2
tabulapdf/tabula-java	Extracts tables from PDF files using Java	1,859
malfrats/xeuledoc	A tool to fetch information about public Google documents from various services	856
kevbite/kevsoft.pdftk	A .NET library that uses the PDFtk binary to manipulate and process PDF files	37
tecracer/cdk-templates	Provides reusable templates and tools for deploying AWS CDK applications	119
jgranstrom/sass-extract	Extracts structured variables from Sass files and makes them available in JavaScript for use in styles or dynamic content.	186
ravsii/textra	A package that extracts and works with Go struct fields as values, including type information.	6
gunnarmorling/quarkus-pdf-extract	A Quarkus-based microservice to extract text from PDF files	24
idea-fasoc/datasheet-scrubber	Automates extraction of key circuit information from PDF datasheets/documents to build a database of commercial off-the-shelf IP.	51