aws-pdf-textract-pipeline
PDF extractor
A data pipeline for extracting structured data from PDFs using AWS Textract and cloud-based services
Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
164 stars
3 watching
18 forks
Language: TypeScript
last commit: 6 months ago
Linked from 1 awesome list
awsaws-cdkaws-textractcdkcloudformationdata-pipelinedynamodbjestlambdapdfpuppeteers3serverlesssnstextracttypescriptwebscraping
Related projects:
Repository | Description | Stars |
---|---|---|
steelthread/mimeograph | A CoffeeScript library for extracting text from PDFs and creating searchable files | 28 |
fourdigits/wagtail_textract | A Django package that enhances Wagtail's document search with text extraction capabilities using Tesseract and Textract libraries. | 33 |
leofcardoso/pdf2pdfocr | A tool to extract text from PDFs and add a searchable layer to them | 274 |
ckorzen/pdf-text-extraction-benchmark | Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles | 65 |
uglytoad/pdfpig | A C# library for extracting and analyzing text from PDF files | 1,733 |
aymericbeaumet/squeeze | A tool to extract relevant information from text | 17 |
gamallo/galextra | A multi-language term extractor that uses morphosyntax tagging and filtering to identify multi-word terms from plain text input. | 2 |
tabulapdf/tabula-java | Extracts tables from PDF files using Java | 1,843 |
malfrats/xeuledoc | A tool to fetch information about public Google documents from various services | 846 |
kevbite/kevsoft.pdftk | A .NET library that uses the PDFtk binary to manipulate and process PDF files | 37 |
tecracer/cdk-templates | Provides reusable templates and tools for deploying AWS CDK applications | 118 |
jgranstrom/sass-extract | Extracts structured variables from Sass files and makes them available in JavaScript for use in styles or dynamic content. | 186 |
ravsii/textra | A package that extracts and works with Go struct fields as values, including type information. | 6 |
gunnarmorling/quarkus-pdf-extract | A Quarkus-based microservice to extract text from PDF files | 24 |
idea-fasoc/datasheet-scrubber | Automates extraction of key circuit information from PDF datasheets/documents to build a database of commercial off-the-shelf IP. | 51 |