datatrove

Data pipeline framework

A platform-agnostic data processing framework for large-scale text data pipelines

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

GitHub

2k stars
46 watching
146 forks
Language: Python
last commit: 6 days ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
giacbrd/smartpipeline A framework for designing and executing concurrent data pipelines with a focus on simplicity and efficiency 23
vectaport/flowgraph A software framework for building scalable, asynchronous data pipelines with explicit back-pressure management and logging capabilities. 60
pdpipe/pdpipe A tool for creating and managing data pipelines with pandas DataFrames 716
ypares/porcupine A tool that enables data manipulation and analysis pipelines to be flexible, reusable, and reproducible in different environments 89
databiosphere/toil A workflow management system designed to efficiently run pipelines in various environments. 901
log2timeline/dftimewolf A framework for orchestrating data collection, processing, and export 296
dataform-co/dataform A framework for managing data operations in BigQuery using SQL and software engineering best practices 850
galaxyproject/galaxy An integrated framework for data-intensive scientific analysis and workflow management 1,410
mara/mara-pipelines A lightweight ETL framework providing a simple way to define and execute data transformation pipelines using declarative Python code. 2,081
olirice/flupy A library that provides a fluent interface for processing data pipelines in Python without holding large amounts of memory 193
johnsonc/lambdo A workflow engine for unifying feature engineering and machine learning operations in data analysis pipelines 1
valeriobasile/learningbyreading A software framework for building NLP and entity linking pipelines with semantic parsing, word sense disambiguation, and entity linking capabilities. 82
intentmedia/mario A library that enables the definition of complex data pipelines in a functional, typesafe, and efficient way using a declarative syntax 139
druths/xp A tool for creating flexible and self-documenting data science pipelines 56
datasalt/pangool A Java framework that simplifies Hadoop's MapReduce API to build efficient data processing pipelines 57