optimus

Data prep library

A Python library that provides a simple API for data preparation and analysis on various big-data engines

truck Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

GitHub

1k stars
38 watching
232 forks
Language: Python
last commit: about 2 months ago
Linked from 4 awesome lists

big-data-cleaningbigdatacudfdaskdask-cudfdata-analysisdata-cleanerdata-cleaningdata-cleansingdata-explorationdata-extractiondata-preparationdata-profilingdata-sciencedata-transformationdata-wranglingmachine-learningpysparkspark

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
sfu-db/dataprep A Python library for rapidly collecting, cleaning, and visualizing data with minimal code 2,088
ibm/data-prep-kit A toolkit for streamlining data preparation for developers building large language model applications 363
vagmcs/optimus A mathematical optimization library written in Scala, supporting linear and quadratic programming with various solver options. 141
iceye-ltd/icecube A Python library designed to organize SAR images and annotations for supervised machine learning applications. 81
tum-i4/oedipus A framework that uses machine learning to uncover metadata from obfuscated programs 11
pytorch/data Provides scalable, performant data loading solutions and utilities to be shared by PyTorch domain libraries 1,149
zygmuntz/kaggle-merck Provides tools to prepare and process data for the Merck challenge at Kaggle 10
primlabs/bucket A library providing a simple storage solution using stable memory, allowing canisters to store data without GC costs and upgradeability. 31
msamogh/nonechucks Library that provides dynamic data cleaning and filtering capabilities for PyTorch datasets and samplers 378
dropbox/pyhive Provides interfaces to connect and interact with data sources like Hive and Presto using Python. 1,676
catalyst-cooperative/pudl Provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists. 492
opendatacube/datacube-core A Python-based platform for integrated gridded data analysis from decades of Earth observation satellite data 518
maximtrp/scikit-posthocs Provides tools for conducting pairwise multiple comparisons tests in statistical data analysis 354
pydap/pydap A Python library for accessing and manipulating scientific data over the internet using the OPeNDAP protocol. 139
ekami/torchlite High-level library to simplify machine learning tasks by abstracting repetitive code 32