NeMo-Curator

Data curator

A toolkit for fast and scalable data preparation and curation for large language models

Scalable data pre processing and curation toolkit for LLMs

GitHub

672 stars

14 watching

91 forks

Language: Jupyter Notebook

last commit: over 1 year ago

Linked from 1 awesome list

datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication

Backlinks from these awesome lists:

ethicalml/awesome-production-machine-learning

Related projects:

Repository	Description	Stars
databricks/lilac	A tool to improve data quality and efficiency for large language models	987
nvidia/dataset_synthesizer	Generates synthetic images and associated data for training deep learning models	574
ayush1997/visualize_ml	A Python package for data analysis and visualization in machine learning	198
iterative/datachain	A Python-based framework for transforming and analyzing unstructured data from various formats like images, audio, videos, text, and PDFs.	2,088
01-ai/yi	A series of large language models trained from scratch to excel in multiple NLP tasks	7,743
code-kern-ai/refinery	A tool to help data scientists manage and annotate natural language data for training AI models	1,405
curiosity-ai/catalyst	A C# Natural Language Processing library with pre-trained models and tools for building custom models	752
pharo-ai/datasets	A Smalltalk library for loading and managing datasets as data frames.	9
fairdataihub/fairshare	Software for organizing and sharing biomedical research data according to FAIR guidelines	75
trypromptly/llmstack	A tool for building and deploying generative AI applications with a no-code multi-agent framework	1,659
laion-ai/clip_benchmark	Evaluates and compares the performance of various CLIP-like models on different tasks and datasets.	632
mage-os-lab/module-catalog-data-ai	Automates product content generation using AI to improve SEO and customer experience.	26
neumino/chateau	A data explorer tool for RethinkDB databases	207
nvidia/sentiment-discovery	Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks	1,061
alexa/massive	A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset	541