NeMo-Curator
Data curator
A toolkit for fast and scalable data preparation and curation for large language models
Scalable data pre processing and curation toolkit for LLMs
672 stars
14 watching
91 forks
Language: Jupyter Notebook
last commit: 2 months ago
Linked from 1 awesome list
datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication
Related projects:
Repository | Description | Stars |
---|---|---|
| A tool to improve data quality and efficiency for large language models | 987 |
| Generates synthetic images and associated data for training deep learning models | 574 |
| A Python package for data analysis and visualization in machine learning | 198 |
| A Python-based framework for transforming and analyzing unstructured data from various formats like images, audio, videos, text, and PDFs. | 2,088 |
| A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 |
| A tool to help data scientists manage and annotate natural language data for training AI models | 1,405 |
| A C# Natural Language Processing library with pre-trained models and tools for building custom models | 752 |
| A Smalltalk library for loading and managing datasets as data frames. | 9 |
| Software for organizing and sharing biomedical research data according to FAIR guidelines | 75 |
| A tool for building and deploying generative AI applications with a no-code multi-agent framework | 1,659 |
| Evaluates and compares the performance of various CLIP-like models on different tasks and datasets. | 632 |
| Automates product content generation using AI to improve SEO and customer experience. | 26 |
| A data explorer tool for RethinkDB databases | 207 |
| Large-scale unsupervised language modeling for robust sentiment classification and related NLP tasks | 1,061 |
| A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 541 |