ambrosia

Dataset cleaner

A command-line tool for improving text datasets used in machine learning by removing duplicates and filtering out unwanted data

clean up your LLM datasets

GitHub

114 stars
0 watching
2 forks
Language: Go
last commit: over 1 year ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
constantamateur/soupx A tool to quantify and remove cell-specific mRNA contamination from single-cell RNA-seq data. 259
msamogh/nonechucks Library that provides dynamic data cleaning and filtering capabilities for PyTorch datasets and samplers 377
databasecleaner/database_cleaner-mongoid A tool for cleaning up data in MongoDB databases. 9
gianlucam76/k8s-cleaner A Kubernetes controller that identifies and removes unhealthy or unused resources to maintain a clean and efficient cluster. 313
sendgrid/krampus A tool designed to automatically delete and disable unwanted AWS resources, allowing for automated security management. 59
f34nk/tidy_ex A C-based Elixir binding for a popular HTML cleaning tool. 9
cgnorthcutt/rankpruning An algorithm and package for handling noisy labels in binary classification problems 82
sfirke/janitor A collection of R functions for simplifying data cleaning and preparation tasks. 1,392
robhabraken/shrink Analyzes and cleans up media libraries in Sitecore databases to optimize storage usage. 2
aprilyuge/respan A Python package implementing a deep learning model for batch correction in single-cell RNA sequencing data 13
code-kern-ai/refinery A tool to help data scientists manage and annotate natural language data for training AI models 1,402
ropensci/tidyhydat A package providing functions to access and process Canadian hydrometric data from various sources. 71
niehs/amadeus A package providing a standardized mechanism for accessing and utilizing large-scale environmental data in R. 7
khaiql/dbcleaner A tool to clean up and manage databases during testing by temporarily locking tables to prevent race conditions. 161
databricks/lilac A tool to improve data quality and efficiency for large language models 969