NeMo-Curator
AI dataset curator
A tool for efficiently preparing and curating large datasets for AI model training, leveraging GPU acceleration.
Scalable data pre processing and curation toolkit for LLMs
609 stars
15 watching
83 forks
Language: Jupyter Notebook
last commit: 5 days ago
Linked from 1 awesome list
datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication
Related projects:
Repository | Description | Stars |
---|---|---|
databricks/lilac | A tool to improve data quality and efficiency for large language models | 969 |
nvidia/dataset_synthesizer | Generates synthetic images and associated data for training deep learning models | 573 |
ayush1997/visualize_ml | A Python package for data analysis and visualization in machine learning | 200 |
iterative/datachain | An AI-data warehouse that transforms and analyzes unstructured data from various formats | 1,935 |
01-ai/yi | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,699 |
code-kern-ai/refinery | A tool to help data scientists manage and annotate natural language data for training AI models | 1,402 |
curiosity-ai/catalyst | A C# Natural Language Processing library with pre-trained models and tools for building custom models | 739 |
pharo-ai/datasets | A Smalltalk library for loading and managing datasets as data frames. | 9 |
fairdataihub/fairshare | Software for organizing and sharing biomedical research data according to FAIR guidelines | 75 |
trypromptly/llmstack | A tool for building and deploying generative AI applications with a no-code multi-agent framework | 1,610 |
laion-ai/clip_benchmark | Evaluates and compares the performance of various CLIP-like models on different tasks and datasets. | 615 |
mage-os-lab/module-catalog-data-ai | Automates product content generation using AI to improve SEO and customer experience. | 25 |
neumino/chateau | An admin interface for managing data in RethinkDB | 207 |
nvidia/sentiment-discovery | Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks | 1,062 |
alexa/massive | A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 538 |