awesome-instruction-dataset
Instruction datasets
A collection of datasets to train instruction-following language models
A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
1k stars
16 watching
59 forks
last commit: 11 months ago
Linked from 1 awesome list
awsome-listsdatasetsgpt-3gpt-4instruction-followinginstruction-tuninglanguage-modelllama
awesome-text/visual-instruction-tuning-dataset | |||
nichtdax/awesome-totally-open-chatgpt | 4,537 | over 1 year ago | : A codebase of totally open alternatives to ChatGPT |
Table of Contents / The Multi-modal Instruction Dataset | |||
(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX | |||
(haotian-liu/LLaVA)|150K|EN|MT|MIX | |||
Table of Contents / The Instruction tuning Dataset | |||
(tatsu-lab/Alpaca)|52K|EN|MT|SI | 29,554 | 4 months ago | |
(gururise/Cleaned Alpaca)|52K|EN|MT|SI | 1,516 | over 1 year ago | |
(XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI | 453 | 6 months ago | |
(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI | |||
(Hello-SimpleAI/HC3)|24K|EN|MT|MIX | |||
(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX | |||
(allenai/prosocial-dialog)|58K|EN|MT|MIX | |||
(allenai/natural-instructions)|1.6K|ML|MT|HG | 958 | 12 months ago | |
(bigscience/xP3)|N/A|ML|MT|MIX | |||
(nomic-ai/gpt4all)|437k|EN|MT|COL | 70,694 | 6 days ago | |
(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL | |||
(google-research/FLAN)|N/A|EN|MT|MIX | 1,474 | 27 days ago | |
(thunlp/UltraChat)|280k|EN|TS|MIX | 2,259 | 8 months ago | |
(cascip/ChatAlpaca)|10k|EN|MT|MIX | 164 | over 1 year ago | |
(YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL | |||
(orhonovich/unnatural-instructions)|240K|EN|MT|MIX | 175 | over 1 year ago | |
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI | 4,210 | over 1 year ago | |
(databrickslabs/dolly)|15K|EN|MT|HG | 10,820 | over 1 year ago | |
(OpenAssistant/oasst1)|161K|ML|MT|HG | |||
(RyokoAI/ShareGPT52K)|90K|ML|MT|SI | |||
(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX | |||
Table of Contents / Reinforcement Learning from Human Feedback (RLHF) Datasets | |||
(Anthropic/hh-rlhf)|22k|EN|MT|MIX | |||
(thu-coai/Safety-Prompts)|100k|CN|MT|MIX | 870 | 9 months ago | |
(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG | |||
(stanfordnlp/SHP)|385k|EN|MT|HG | |||
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX | 4,210 | over 1 year ago | |
The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX | |||
ChatCaptioner | 452 | over 1 year ago | Summary: A high-quality, well-aligned (e.g. more detailed image desciption) image-text dataset created using conversation between two bots, similar to . This image-text dataset can then be used with some predefined instruction template for image-instruction-answer finetuning |
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models | 25,422 | 3 months ago | paper: |
BSD 3-Clause | License: | ||
The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX / Related: | |||
Interactive ChatCaptioner for image and video | 452 | over 1 year ago | |
The Multi-modal Instruction Datasets / (haotian-liu/LLaVA)|150K|EN|MT|MIX | |||
Visual Instruction Tuning | paper: | ||
CC BY-NC 4.0 | License: | ||
The Multi-modal Instruction Datasets / [({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV} | |||
InstructCV | paper: | ||
CC BY-NC 4.0 | License: | ||
The Instruction-following Datasets / (tatsu-lab/Alpaca)|52K|EN|MT|SI | |||
alpaca-blog | paper: | ||
CC BY-NC 4.0 | License: | ||
The Instruction-following Datasets / (gururise/Cleaned Alpaca)|52K|EN|MT|SI | |||
CC BY-NC 4.0 | License: | ||
The Instruction-following Datasets / (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI | |||
GPL-3.0 | License: | ||
The Instruction-following Datasets / (Hello-SimpleAI/HC3)|24K|EN|MT|MIX | |||
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection | paper: | ||
CC BY-SA 4.0 | License: | ||
The Instruction-following Datasets / (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX | |||
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection | paper: | ||
CC BY-SA 4.0 | License: | ||
The Instruction-following Datasets / (allenai/prosocial-dialog)|58K|EN|MT|MIX | |||
ProsocialDialog: A Prosocial Backbone for Conversational Agents | paper: | ||
CC BY 4.0 | License: | ||
The Instruction-following Datasets / (allenai/natural-instructions)|1.6K|ML|MT|HG | |||
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks | paper: | ||
Apache License 2.0 | License: | ||
The Instruction-following Datasets / (bigscience/xP3)|N/A|ML|MT|MIX | |||
Crosslingual Generalization through Multitask Finetuning | paper: | ||
Apache License 2.0 | License: | ||
The Instruction-following Datasets / (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL | |||
Github Repo | 2,619 | 12 months ago | Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets |
Apache License 2.0 | License: | ||
The Instruction-following Datasets / (nomic-ai/gpt4all)|437k|EN|MT|COL | |||
laion/OIG | Summary: gpt4all leverages three publicly available datasets: 1. , 2. 3. subset of | ||
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo | paper: | ||
MIT License | License: | ||
The Instruction-following Datasets / (teknium1/GPTeacher)|20k+|EN|MT|SI | |||
MIT License | License: | ||
The Instruction-following Datasets / (google-research/FLAN)|N/A|EN|MT|MIX | |||
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning | paper: | ||
Apache License 2.0 | License: | ||
The Instruction-following Datasets / (thunlp/UltraChat)|280k|EN|TS|MIX | |||
CC BY-NC 4.0 | License: | ||
The Instruction-following Datasets / (cascip/ChatAlpaca)|10k|EN|MT|MIX | |||
Apache License 2.0 | License: | ||
(tatsu-lab/Alpaca)|52K|EN|MT|SI | 29,554 | 4 months ago | Related: |
The Instruction-following Datasets / (orhonovich/unnatural-instructions)|240K|EN|MT|MIX | |||
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor | paper: | ||
MIT License | License: | ||
The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI | |||
Instruction Tuning with GPT-4 | paper: | ||
CC BY-NC 4.0 | License: | ||
The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI / Related: | |||
(tatsu-lab/Alpaca)|52K|EN|MT|SI | 29,554 | 4 months ago | |
(orhonovich/unnatural-instructions)|240K|EN|MT|MIX | 175 | over 1 year ago | |
The Instruction-following Datasets / (databrickslabs/dolly)|15K|EN|MT|HG | |||
Free Dolly | paper: | ||
CC BY-SA 3.0 | License: | ||
The Instruction-following Datasets / (OpenAssistant/oasst1)|161K|ML|MT|HG | |||
OpenAssistant Conversations - Democratizing Large Language Model Alignment | paper: | ||
Apache License 2.0 | License: | ||
The Instruction-following Datasets / (RyokoAI/ShareGPT52K)|90K|ML|MT|SI | |||
CC0 1.0 Universal | License: | ||
The Instruction-following Datasets / (zjunlp/Mol-Instructions)|2043K|ML|MT|MIX | |||
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models | paper: | ||
CC BY 4.0 | 252 | 24 days ago | License: |
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX | |||
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | paper: | ||
MIT License | License: | ||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX / Related: | |||
(Hello-SimpleAI/HC3)|24K|EN|MT|MIX | |||
(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX | |||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (thu-coai/Safety-Prompts)|100k|CN|MT|MIX | |||
Safety Assessment of Chinese Large Language Models | paper: | ||
Apache License 2.0 | License: | ||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG | |||
A General Language Assistant as a Laboratory for Alignment | paper: | ||
CC BY-SA 4.0 | License: | ||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG / Related: | |||
stack-exchange-paired | |||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX | |||
Instruction Tuning with GPT-4 | paper: | ||
CC BY-NC 4.0 | License: | ||
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX / Related: | |||
(tatsu-lab/Alpaca)|52K|EN|MT|SI | 29,554 | 4 months ago | |
Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Reddit/eli5)|500k|EN|MT|HG | |||
r/explainlikeimfive | summary: This dataset contains questions and answers from the subreddits , and | ||
eli5 dataset | Related: a transformation of the dataset in a format similar to |