NeMo-Curator

Data curator

A toolkit for fast and scalable data preparation and curation for large language models

Scalable data pre processing and curation toolkit for LLMs

GitHub

672 stars
14 watching
91 forks
Language: Jupyter Notebook
last commit: 1 day ago
Linked from 1 awesome list

datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
databricks/lilac A tool to improve data quality and efficiency for large language models 987
nvidia/dataset_synthesizer Generates synthetic images and associated data for training deep learning models 574
ayush1997/visualize_ml A Python package for data analysis and visualization in machine learning 198
iterative/datachain A Python-based framework for transforming and analyzing unstructured data from various formats like images, audio, videos, text, and PDFs. 2,088
01-ai/yi A series of large language models trained from scratch to excel in multiple NLP tasks 7,743
code-kern-ai/refinery A tool to help data scientists manage and annotate natural language data for training AI models 1,405
curiosity-ai/catalyst A C# Natural Language Processing library with pre-trained models and tools for building custom models 752
pharo-ai/datasets A Smalltalk library for loading and managing datasets as data frames. 9
fairdataihub/fairshare Software for organizing and sharing biomedical research data according to FAIR guidelines 75
trypromptly/llmstack A tool for building and deploying generative AI applications with a no-code multi-agent framework 1,659
laion-ai/clip_benchmark Evaluates and compares the performance of various CLIP-like models on different tasks and datasets. 632
mage-os-lab/module-catalog-data-ai Automates product content generation using AI to improve SEO and customer experience. 26
neumino/chateau A data explorer tool for RethinkDB databases 207
nvidia/sentiment-discovery Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks 1,061
alexa/massive A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset 541