datatrove
Data pipeline framework
A platform-agnostic data processing framework for large-scale text data pipelines
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
2k stars
46 watching
146 forks
Language: Python
last commit: 6 days ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
giacbrd/smartpipeline | A framework for designing and executing concurrent data pipelines with a focus on simplicity and efficiency | 23 |
vectaport/flowgraph | A software framework for building scalable, asynchronous data pipelines with explicit back-pressure management and logging capabilities. | 60 |
pdpipe/pdpipe | A tool for creating and managing data pipelines with pandas DataFrames | 716 |
ypares/porcupine | A tool that enables data manipulation and analysis pipelines to be flexible, reusable, and reproducible in different environments | 89 |
databiosphere/toil | A workflow management system designed to efficiently run pipelines in various environments. | 901 |
log2timeline/dftimewolf | A framework for orchestrating data collection, processing, and export | 296 |
dataform-co/dataform | A framework for managing data operations in BigQuery using SQL and software engineering best practices | 850 |
galaxyproject/galaxy | An integrated framework for data-intensive scientific analysis and workflow management | 1,410 |
mara/mara-pipelines | A lightweight ETL framework providing a simple way to define and execute data transformation pipelines using declarative Python code. | 2,081 |
olirice/flupy | A library that provides a fluent interface for processing data pipelines in Python without holding large amounts of memory | 193 |
johnsonc/lambdo | A workflow engine for unifying feature engineering and machine learning operations in data analysis pipelines | 1 |
valeriobasile/learningbyreading | A software framework for building NLP and entity linking pipelines with semantic parsing, word sense disambiguation, and entity linking capabilities. | 82 |
intentmedia/mario | A library that enables the definition of complex data pipelines in a functional, typesafe, and efficient way using a declarative syntax | 139 |
druths/xp | A tool for creating flexible and self-documenting data science pipelines | 56 |
datasalt/pangool | A Java framework that simplifies Hadoop's MapReduce API to build efficient data processing pipelines | 57 |