NeMo-Curator
Data curator
A toolkit for fast and scalable data preparation and curation for large language models
Scalable data pre processing and curation toolkit for LLMs
672 stars
14 watching
91 forks
Language: Jupyter Notebook
last commit: 1 day ago
Linked from 1 awesome list
datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication
Related projects:
Repository | Description | Stars |
---|---|---|
databricks/lilac | A tool to improve data quality and efficiency for large language models | 987 |
nvidia/dataset_synthesizer | Generates synthetic images and associated data for training deep learning models | 574 |
ayush1997/visualize_ml | A Python package for data analysis and visualization in machine learning | 198 |
iterative/datachain | A Python-based framework for transforming and analyzing unstructured data from various formats like images, audio, videos, text, and PDFs. | 2,088 |
01-ai/yi | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 |
code-kern-ai/refinery | A tool to help data scientists manage and annotate natural language data for training AI models | 1,405 |
curiosity-ai/catalyst | A C# Natural Language Processing library with pre-trained models and tools for building custom models | 752 |
pharo-ai/datasets | A Smalltalk library for loading and managing datasets as data frames. | 9 |
fairdataihub/fairshare | Software for organizing and sharing biomedical research data according to FAIR guidelines | 75 |
trypromptly/llmstack | A tool for building and deploying generative AI applications with a no-code multi-agent framework | 1,659 |
laion-ai/clip_benchmark | Evaluates and compares the performance of various CLIP-like models on different tasks and datasets. | 632 |
mage-os-lab/module-catalog-data-ai | Automates product content generation using AI to improve SEO and customer experience. | 26 |
neumino/chateau | A data explorer tool for RethinkDB databases | 207 |
nvidia/sentiment-discovery | Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks | 1,061 |
alexa/massive | A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 541 |