 ambrosia
 ambrosia 
 Dataset cleaner
 A command-line tool for improving text datasets used in machine learning by removing duplicates and filtering out unwanted data
clean up your LLM datasets
114 stars
 0 watching
 2 forks
 
Language: Go 
last commit: over 2 years ago 
Linked from   1 awesome list  
 Related projects:
| Repository | Description | Stars | 
|---|---|---|
|  | A tool to quantify and remove cell-specific mRNA contamination from single-cell RNA-seq data. | 260 | 
|  | Library that provides dynamic data cleaning and filtering capabilities for PyTorch datasets and samplers | 378 | 
|  | A tool for cleaning up data in MongoDB databases. | 9 | 
|  | An automated tool for identifying and cleaning up unused or unhealthy Kubernetes resources to maintain efficient cluster performance. | 323 | 
|  | A tool designed to automatically delete and disable unwanted AWS resources, allowing for automated security management. | 59 | 
|  | A C-based Elixir binding for a popular HTML cleaning tool. | 9 | 
|  | An algorithm and package for handling noisy labels in binary classification problems | 82 | 
|  | A collection of R functions for simplifying data cleaning and preparation tasks. | 1,398 | 
|  | Analyzes and cleans up media libraries in Sitecore databases to optimize storage usage. | 2 | 
|  | A Python package implementing a deep learning model for batch correction in single-cell RNA sequencing data | 13 | 
|  | A tool to help data scientists manage and annotate natural language data for training AI models | 1,405 | 
|  | A package providing functions to access and process Canadian hydrometric data from various sources. | 71 | 
|  | A package providing a standardized mechanism for accessing and utilizing large-scale environmental data in R. | 8 | 
|  | A tool to clean up and manage databases during testing by temporarily locking tables to prevent race conditions. | 161 | 
|  | A tool to improve data quality and efficiency for large language models | 987 |