NeMo-Curator

AI dataset curator

A tool for efficiently preparing and curating large datasets for AI model training, leveraging GPU acceleration.

Scalable data pre processing and curation toolkit for LLMs

GitHub

609 stars
15 watching
83 forks
Language: Jupyter Notebook
last commit: 5 days ago
Linked from 1 awesome list

datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
databricks/lilac A tool to improve data quality and efficiency for large language models 969
nvidia/dataset_synthesizer Generates synthetic images and associated data for training deep learning models 573
ayush1997/visualize_ml A Python package for data analysis and visualization in machine learning 200
iterative/datachain An AI-data warehouse that transforms and analyzes unstructured data from various formats 1,935
01-ai/yi A series of large language models trained from scratch to excel in multiple NLP tasks 7,699
code-kern-ai/refinery A tool to help data scientists manage and annotate natural language data for training AI models 1,402
curiosity-ai/catalyst A C# Natural Language Processing library with pre-trained models and tools for building custom models 739
pharo-ai/datasets A Smalltalk library for loading and managing datasets as data frames. 9
fairdataihub/fairshare Software for organizing and sharing biomedical research data according to FAIR guidelines 75
trypromptly/llmstack A tool for building and deploying generative AI applications with a no-code multi-agent framework 1,610
laion-ai/clip_benchmark Evaluates and compares the performance of various CLIP-like models on different tasks and datasets. 615
mage-os-lab/module-catalog-data-ai Automates product content generation using AI to improve SEO and customer experience. 25
neumino/chateau An admin interface for managing data in RethinkDB 207
nvidia/sentiment-discovery Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks 1,062
alexa/massive A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset 538