awesome-instruction-dataset

Instruction datasets

A collection of datasets to train instruction-following language models

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)

GitHub

1k stars

16 watching

60 forks

last commit: almost 2 years ago

Linked from 1 awesome list

awsome-listsdatasetsgpt-3gpt-4instruction-followinginstruction-tuninglanguage-modelllama

awesome-text/visual-instruction-tuning-dataset
nichtdax/awesome-totally-open-chatgpt	4,556	over 2 years ago	: A codebase of totally open alternatives to ChatGPT
Table of Contents / The Multi-modal Instruction Dataset
(Vision-CAIR/MiniGPT-4)\|5K\|EN\|MT\|MIX
(haotian-liu/LLaVA)\|150K\|EN\|MT\|MIX
Table of Contents / The Instruction tuning Dataset
(tatsu-lab/Alpaca)\|52K\|EN\|MT\|SI	29,663	over 1 year ago
(gururise/Cleaned Alpaca)\|52K\|EN\|MT\|SI	1,525	over 2 years ago
(XueFuzhao/InstructionWild)\|52K\|EN\|CN\|MT\|SI	455	over 1 year ago
(JosephusCheung/GuanacoDataset)\|534K\|ML\|MT\|SI
(Hello-SimpleAI/HC3)\|24K\|EN\|MT\|MIX
(Hello-SimpleAI/HC3-Chinese)\|13K\|CN\|MT\|MIX
(allenai/prosocial-dialog)\|58K\|EN\|MT\|MIX
(allenai/natural-instructions)\|1.6K\|ML\|MT\|HG	963	almost 2 years ago
(bigscience/xP3)\|N/A\|ML\|MT\|MIX
(nomic-ai/gpt4all)\|437k\|EN\|MT\|COL	71,176	11 months ago
(PhoebusSi/Alpaca-CoT)\|500k\|ML\|MT\|COL
(google-research/FLAN)\|N/A\|EN\|MT\|MIX	1,484	about 1 year ago
(thunlp/UltraChat)\|280k\|EN\|TS\|MIX	2,276	over 1 year ago
(cascip/ChatAlpaca)\|10k\|EN\|MT\|MIX	164	over 2 years ago
(YeungNLP/firefly-train-1.1M)\|1100k\|CN\|MT\|COL
(orhonovich/unnatural-instructions)\|240K\|EN\|MT\|MIX	176	over 2 years ago
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|CN\|MT\|SI	4,244	over 2 years ago
(databrickslabs/dolly)\|15K\|EN\|MT\|HG	10,820	over 2 years ago
(OpenAssistant/oasst1)\|161K\|ML\|MT\|HG
(RyokoAI/ShareGPT52K)\|90K\|ML\|MT\|SI
(zjunlp/Mol-Instructions)\|2043K\|ML\|MT\|MIX
Table of Contents / Reinforcement Learning from Human Feedback (RLHF) Datasets
(Anthropic/hh-rlhf)\|22k\|EN\|MT\|MIX
(thu-coai/Safety-Prompts)\|100k\|CN\|MT\|MIX	880	over 1 year ago
(HuggingFaceH4/stack-exchange-preferences)\|10741k\|EN\|TS\|HG
(stanfordnlp/SHP)\|385k\|EN\|MT\|HG
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|MT\|MIX	4,244	over 2 years ago
The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)\|5K\|EN\|MT\|MIX
ChatCaptioner	457	over 2 years ago	Summary: A high-quality, well-aligned (e.g. more detailed image desciption) image-text dataset created using conversation between two bots, similar to . This image-text dataset can then be used with some predefined instruction template for image-instruction-answer finetuning
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models	25,490	about 1 year ago	paper:
BSD 3-Clause			License:
The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)\|5K\|EN\|MT\|MIX / Related:
Interactive ChatCaptioner for image and video	457	over 2 years ago
The Multi-modal Instruction Datasets / (haotian-liu/LLaVA)\|150K\|EN\|MT\|MIX
Visual Instruction Tuning			paper:
CC BY-NC 4.0			License:
The Multi-modal Instruction Datasets / [({sunrainyg}/{InstructCV)\|EN\|MT\|MIX}]{https://github.com/AlaaLab/InstructCV}
InstructCV			paper:
CC BY-NC 4.0			License:
The Instruction-following Datasets / (tatsu-lab/Alpaca)\|52K\|EN\|MT\|SI
alpaca-blog			paper:
CC BY-NC 4.0			License:
The Instruction-following Datasets / (gururise/Cleaned Alpaca)\|52K\|EN\|MT\|SI
CC BY-NC 4.0			License:
The Instruction-following Datasets / (JosephusCheung/GuanacoDataset)\|534K\|ML\|MT\|SI
GPL-3.0			License:
The Instruction-following Datasets / (Hello-SimpleAI/HC3)\|24K\|EN\|MT\|MIX
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection			paper:
CC BY-SA 4.0			License:
The Instruction-following Datasets / (Hello-SimpleAI/HC3-Chinese)\|13K\|CN\|MT\|MIX
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection			paper:
CC BY-SA 4.0			License:
The Instruction-following Datasets / (allenai/prosocial-dialog)\|58K\|EN\|MT\|MIX
ProsocialDialog: A Prosocial Backbone for Conversational Agents			paper:
CC BY 4.0			License:
The Instruction-following Datasets / (allenai/natural-instructions)\|1.6K\|ML\|MT\|HG
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks			paper:
Apache License 2.0			License:
The Instruction-following Datasets / (bigscience/xP3)\|N/A\|ML\|MT\|MIX
Crosslingual Generalization through Multitask Finetuning			paper:
Apache License 2.0			License:
The Instruction-following Datasets / (PhoebusSi/Alpaca-CoT)\|500k\|ML\|MT\|COL
Github Repo	2,640	almost 2 years ago	Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets
Apache License 2.0			License:
The Instruction-following Datasets / (nomic-ai/gpt4all)\|437k\|EN\|MT\|COL
laion/OIG			Summary: gpt4all leverages three publicly available datasets: 1. , 2. 3. subset of
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo			paper:
MIT License			License:
The Instruction-following Datasets / (teknium1/GPTeacher)\|20k+\|EN\|MT\|SI
MIT License			License:
The Instruction-following Datasets / (google-research/FLAN)\|N/A\|EN\|MT\|MIX
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning			paper:
Apache License 2.0			License:
The Instruction-following Datasets / (thunlp/UltraChat)\|280k\|EN\|TS\|MIX
CC BY-NC 4.0			License:
The Instruction-following Datasets / (cascip/ChatAlpaca)\|10k\|EN\|MT\|MIX
Apache License 2.0			License:
(tatsu-lab/Alpaca)\|52K\|EN\|MT\|SI	29,663	over 1 year ago	Related:
The Instruction-following Datasets / (orhonovich/unnatural-instructions)\|240K\|EN\|MT\|MIX
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor			paper:
MIT License			License:
The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|CN\|MT\|SI
Instruction Tuning with GPT-4			paper:
CC BY-NC 4.0			License:
The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|CN\|MT\|SI / Related:
(tatsu-lab/Alpaca)\|52K\|EN\|MT\|SI	29,663	over 1 year ago
(orhonovich/unnatural-instructions)\|240K\|EN\|MT\|MIX	176	over 2 years ago
The Instruction-following Datasets / (databrickslabs/dolly)\|15K\|EN\|MT\|HG
Free Dolly			paper:
CC BY-SA 3.0			License:
The Instruction-following Datasets / (OpenAssistant/oasst1)\|161K\|ML\|MT\|HG
OpenAssistant Conversations - Democratizing Large Language Model Alignment			paper:
Apache License 2.0			License:
The Instruction-following Datasets / (RyokoAI/ShareGPT52K)\|90K\|ML\|MT\|SI
CC0 1.0 Universal			License:
The Instruction-following Datasets / (zjunlp/Mol-Instructions)\|2043K\|ML\|MT\|MIX
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models			paper:
CC BY 4.0	255	about 1 year ago	License:
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (Anthropic/hh-rlhf)\|22k\|EN\|MT\|MIX
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback			paper:
MIT License			License:
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (Anthropic/hh-rlhf)\|22k\|EN\|MT\|MIX / Related:
(Hello-SimpleAI/HC3)\|24K\|EN\|MT\|MIX
(Hello-SimpleAI/HC3-Chinese)\|13K\|CN\|MT\|MIX
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (thu-coai/Safety-Prompts)\|100k\|CN\|MT\|MIX
Safety Assessment of Chinese Large Language Models			paper:
Apache License 2.0			License:
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)\|10741k\|EN\|TS\|HG
A General Language Assistant as a Laboratory for Alignment			paper:
CC BY-SA 4.0			License:
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)\|10741k\|EN\|TS\|HG / Related:
stack-exchange-paired
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|MT\|MIX
Instruction Tuning with GPT-4			paper:
CC BY-NC 4.0			License:
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)\|52K\|EN\|MT\|MIX / Related:
(tatsu-lab/Alpaca)\|52K\|EN\|MT\|SI	29,663	over 1 year ago
Reinforcement Learning from Human Feedback (RLHF) \| Red-Teaming Datasets / (Reddit/eli5)\|500k\|EN\|MT\|HG
r/explainlikeimfive			summary: This dataset contains questions and answers from the subreddits , and
eli5 dataset			Related: a transformation of the dataset in a format similar to

Backlinks from these awesome lists:

nichtdax/awesome-totally-open-chatgpt

awesome-instruction-dataset

awesome-text/visual-instruction-tuning-dataset

Table of Contents / The Instruction tuning Dataset

Table of Contents / Reinforcement Learning from Human Feedback (RLHF) Datasets

The Instruction-following Datasets / (tatsu-lab/Alpaca)|52K|EN|MT|SI

The Instruction-following Datasets / (gururise/Cleaned Alpaca)|52K|EN|MT|SI

The Instruction-following Datasets / (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI

The Instruction-following Datasets / (Hello-SimpleAI/HC3)|24K|EN|MT|MIX

The Instruction-following Datasets / (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

The Instruction-following Datasets / (allenai/prosocial-dialog)|58K|EN|MT|MIX

The Instruction-following Datasets / (allenai/natural-instructions)|1.6K|ML|MT|HG

The Instruction-following Datasets / (bigscience/xP3)|N/A|ML|MT|MIX

The Instruction-following Datasets / (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL

The Instruction-following Datasets / (nomic-ai/gpt4all)|437k|EN|MT|COL

The Instruction-following Datasets / (teknium1/GPTeacher)|20k+|EN|MT|SI

The Instruction-following Datasets / (google-research/FLAN)|N/A|EN|MT|MIX

The Instruction-following Datasets / (thunlp/UltraChat)|280k|EN|TS|MIX

The Instruction-following Datasets / (cascip/ChatAlpaca)|10k|EN|MT|MIX

The Instruction-following Datasets / (orhonovich/unnatural-instructions)|240K|EN|MT|MIX

The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI

The Instruction-following Datasets / (databrickslabs/dolly)|15K|EN|MT|HG

The Instruction-following Datasets / (OpenAssistant/oasst1)|161K|ML|MT|HG

The Instruction-following Datasets / (zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (thu-coai/Safety-Prompts)|100k|CN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Reddit/eli5)|500k|EN|MT|HG

Backlinks from these awesome lists:

More related projects:

awesome-instruction-dataset

awesome-text/visual-instruction-tuning-dataset

Table of Contents / The Multi-modal Instruction Dataset

Table of Contents / The Instruction tuning Dataset

Table of Contents / Reinforcement Learning from Human Feedback (RLHF) Datasets

The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX

The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX / Related:

The Multi-modal Instruction Datasets / (haotian-liu/LLaVA)|150K|EN|MT|MIX

The Multi-modal Instruction Datasets / [({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}

The Instruction-following Datasets / (tatsu-lab/Alpaca)|52K|EN|MT|SI

The Instruction-following Datasets / (gururise/Cleaned Alpaca)|52K|EN|MT|SI

The Instruction-following Datasets / (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI

The Instruction-following Datasets / (Hello-SimpleAI/HC3)|24K|EN|MT|MIX

The Instruction-following Datasets / (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

The Instruction-following Datasets / (allenai/prosocial-dialog)|58K|EN|MT|MIX

The Instruction-following Datasets / (allenai/natural-instructions)|1.6K|ML|MT|HG

The Instruction-following Datasets / (bigscience/xP3)|N/A|ML|MT|MIX

The Instruction-following Datasets / (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL

The Instruction-following Datasets / (nomic-ai/gpt4all)|437k|EN|MT|COL

The Instruction-following Datasets / (teknium1/GPTeacher)|20k+|EN|MT|SI

The Instruction-following Datasets / (google-research/FLAN)|N/A|EN|MT|MIX

The Instruction-following Datasets / (thunlp/UltraChat)|280k|EN|TS|MIX

The Instruction-following Datasets / (cascip/ChatAlpaca)|10k|EN|MT|MIX

The Instruction-following Datasets / (orhonovich/unnatural-instructions)|240K|EN|MT|MIX

The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI

The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI / Related:

The Instruction-following Datasets / (databrickslabs/dolly)|15K|EN|MT|HG

The Instruction-following Datasets / (OpenAssistant/oasst1)|161K|ML|MT|HG

The Instruction-following Datasets / (RyokoAI/ShareGPT52K)|90K|ML|MT|SI

The Instruction-following Datasets / (zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX / Related:

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (thu-coai/Safety-Prompts)|100k|CN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG / Related:

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX / Related:

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Reddit/eli5)|500k|EN|MT|HG

Backlinks from these awesome lists:

More related projects: