ambrosia
Dataset cleaner
A command-line tool for improving text datasets used in machine learning by removing duplicates and filtering out unwanted data
clean up your LLM datasets
114 stars
0 watching
2 forks
Language: Go
last commit: over 1 year ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
constantamateur/soupx | A tool to quantify and remove cell-specific mRNA contamination from single-cell RNA-seq data. | 259 |
msamogh/nonechucks | Library that provides dynamic data cleaning and filtering capabilities for PyTorch datasets and samplers | 377 |
databasecleaner/database_cleaner-mongoid | A tool for cleaning up data in MongoDB databases. | 9 |
gianlucam76/k8s-cleaner | A Kubernetes controller that identifies and removes unhealthy or unused resources to maintain a clean and efficient cluster. | 313 |
sendgrid/krampus | A tool designed to automatically delete and disable unwanted AWS resources, allowing for automated security management. | 59 |
f34nk/tidy_ex | A C-based Elixir binding for a popular HTML cleaning tool. | 9 |
cgnorthcutt/rankpruning | An algorithm and package for handling noisy labels in binary classification problems | 82 |
sfirke/janitor | A collection of R functions for simplifying data cleaning and preparation tasks. | 1,392 |
robhabraken/shrink | Analyzes and cleans up media libraries in Sitecore databases to optimize storage usage. | 2 |
aprilyuge/respan | A Python package implementing a deep learning model for batch correction in single-cell RNA sequencing data | 13 |
code-kern-ai/refinery | A tool to help data scientists manage and annotate natural language data for training AI models | 1,402 |
ropensci/tidyhydat | A package providing functions to access and process Canadian hydrometric data from various sources. | 71 |
niehs/amadeus | A package providing a standardized mechanism for accessing and utilizing large-scale environmental data in R. | 7 |
khaiql/dbcleaner | A tool to clean up and manage databases during testing by temporarily locking tables to prevent race conditions. | 161 |
databricks/lilac | A tool to improve data quality and efficiency for large language models | 969 |