AlpacaDataCleaned

Language data set

A cleaned and curated version of an Alpaca dataset used to train a large language model

Alpaca dataset from Stanford, cleaned and curated

2k stars

27 watching

153 forks

Language: Python

last commit: over 3 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

yaodongc/awesome-instruction-dataset

Related projects:

Repository	Description	Stars
alvations/seedling	A corpus and API for human language data	11
carbonz0/alpaca-chinese-dataset	A dataset for training and fine-tuning large language models on Chinese text prompts.	392
pointnetwork/point-alpaca	Recreated weights from Stanford Alpaca model fine-tuned for specific task	406
alvations/sugarlike	A tool that identifies languages in text by comparing them to a reference set of patterns.	1
alvations/sugali	A system designed to identify the language of an arbitrary text string using machine learning and multiple data sources.	2
google-research/flan	A repository providing tools and datasets to fine-tune language models for specific tasks	1,484
code-kern-ai/refinery	A tool to help data scientists manage and annotate natural language data for training AI models	1,405
flagai-open/aquila2	Provides pre-trained language models and tools for fine-tuning and evaluation	439
matbahasa/talpco	A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.	49
datacanvasio/alaya	A pre-trained AI model that can engage in natural language conversations with high accuracy and understanding.	43
airaria/visual-chinese-llama-alpaca	Develops a multimodal Chinese language model with visual capabilities	429
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919
alpacahq/alpaca-trade-api-python	A Python client for Alpaca's trade API	1,745
vhellendoorn/code-lms	A guide to using pre-trained large language models in source code analysis and generation	1,789
sparklingpandas/sparklingpandas	Enables distributed data analysis using PySpark and Pandas APIs	362