awesome-instruction-dataset

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)

GitHub

1k stars
16 watching
59 forks
last commit: 9 months ago
Linked from 1 awesome list

awsome-listsdatasetsgpt-3gpt-4instruction-followinginstruction-tuninglanguage-modelllama

awesome-text/visual-instruction-tuning-dataset

nichtdax/awesome-totally-open-chatgpt 4,507 over 1 year ago : A codebase of totally open alternatives to ChatGPT

Table of Contents / The Multi-modal Instruction Dataset

(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX
(haotian-liu/LLaVA)|150K|EN|MT|MIX

Table of Contents / The Instruction tuning Dataset

(tatsu-lab/Alpaca)|52K|EN|MT|SI 29,380 3 months ago
(gururise/Cleaned Alpaca)|52K|EN|MT|SI 1,500 over 1 year ago
(XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI 451 4 months ago
(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI
(Hello-SimpleAI/HC3)|24K|EN|MT|MIX
(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX
(allenai/prosocial-dialog)|58K|EN|MT|MIX
(allenai/natural-instructions)|1.6K|ML|MT|HG 950 10 months ago
(bigscience/xP3)|N/A|ML|MT|MIX
(nomic-ai/gpt4all)|437k|EN|MT|COL 69,594 9 days ago
(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL
(google-research/FLAN)|N/A|EN|MT|MIX 1,463 about 2 months ago
(thunlp/UltraChat)|280k|EN|TS|MIX 2,222 7 months ago
(cascip/ChatAlpaca)|10k|EN|MT|MIX 164 over 1 year ago
(YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL
(orhonovich/unnatural-instructions)|240K|EN|MT|MIX 175 over 1 year ago
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI 4,175 over 1 year ago
(databrickslabs/dolly)|15K|EN|MT|HG 10,811 over 1 year ago
(OpenAssistant/oasst1)|161K|ML|MT|HG
(RyokoAI/ShareGPT52K)|90K|ML|MT|SI
(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Table of Contents / Reinforcement Learning from Human Feedback (RLHF) Datasets

(Anthropic/hh-rlhf)|22k|EN|MT|MIX
(thu-coai/Safety-Prompts)|100k|CN|MT|MIX 853 7 months ago
(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG
(stanfordnlp/SHP)|385k|EN|MT|HG
(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX 4,175 over 1 year ago

The Multi-modal Instruction Datasets / (Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX

ChatCaptioner 450 over 1 year ago Summary: A high-quality, well-aligned (e.g. more detailed image desciption) image-text dataset created using conversation between two bots, similar to . This image-text dataset can then be used with some predefined instruction template for image-instruction-answer finetuning
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models 25,327 about 1 month ago paper:
BSD 3-Clause License:
Interactive ChatCaptioner for image and video 450 over 1 year ago

The Multi-modal Instruction Datasets / (haotian-liu/LLaVA)|150K|EN|MT|MIX

Visual Instruction Tuning paper:
CC BY-NC 4.0 License:

The Multi-modal Instruction Datasets / [({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}

InstructCV paper:
CC BY-NC 4.0 License:

The Instruction-following Datasets / (tatsu-lab/Alpaca)|52K|EN|MT|SI

alpaca-blog paper:
CC BY-NC 4.0 License:

The Instruction-following Datasets / (gururise/Cleaned Alpaca)|52K|EN|MT|SI

CC BY-NC 4.0 License:

The Instruction-following Datasets / (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI

GPL-3.0 License:

The Instruction-following Datasets / (Hello-SimpleAI/HC3)|24K|EN|MT|MIX

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection paper:
CC BY-SA 4.0 License:

The Instruction-following Datasets / (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection paper:
CC BY-SA 4.0 License:

The Instruction-following Datasets / (allenai/prosocial-dialog)|58K|EN|MT|MIX

ProsocialDialog: A Prosocial Backbone for Conversational Agents paper:
CC BY 4.0 License:

The Instruction-following Datasets / (allenai/natural-instructions)|1.6K|ML|MT|HG

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks paper:
Apache License 2.0 License:

The Instruction-following Datasets / (bigscience/xP3)|N/A|ML|MT|MIX

Crosslingual Generalization through Multitask Finetuning paper:
Apache License 2.0 License:

The Instruction-following Datasets / (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL

Github Repo 2,589 10 months ago Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets
Apache License 2.0 License:

The Instruction-following Datasets / (nomic-ai/gpt4all)|437k|EN|MT|COL

laion/OIG Summary: gpt4all leverages three publicly available datasets: 1. , 2. 3. subset of
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo paper:
MIT License License:

The Instruction-following Datasets / (teknium1/GPTeacher)|20k+|EN|MT|SI

MIT License License:

The Instruction-following Datasets / (google-research/FLAN)|N/A|EN|MT|MIX

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning paper:
Apache License 2.0 License:

The Instruction-following Datasets / (thunlp/UltraChat)|280k|EN|TS|MIX

CC BY-NC 4.0 License:

The Instruction-following Datasets / (cascip/ChatAlpaca)|10k|EN|MT|MIX

Apache License 2.0 License:
(tatsu-lab/Alpaca)|52K|EN|MT|SI 29,380 3 months ago Related:

The Instruction-following Datasets / (orhonovich/unnatural-instructions)|240K|EN|MT|MIX

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor paper:
MIT License License:

The Instruction-following Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI

Instruction Tuning with GPT-4 paper:
CC BY-NC 4.0 License:
(tatsu-lab/Alpaca)|52K|EN|MT|SI 29,380 3 months ago
(orhonovich/unnatural-instructions)|240K|EN|MT|MIX 175 over 1 year ago

The Instruction-following Datasets / (databrickslabs/dolly)|15K|EN|MT|HG

Free Dolly paper:
CC BY-SA 3.0 License:

The Instruction-following Datasets / (OpenAssistant/oasst1)|161K|ML|MT|HG

OpenAssistant Conversations - Democratizing Large Language Model Alignment paper:
Apache License 2.0 License:

The Instruction-following Datasets / (RyokoAI/ShareGPT52K)|90K|ML|MT|SI

CC0 1.0 Universal License:

The Instruction-following Datasets / (zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models paper:
CC BY 4.0 236 5 months ago License:

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Anthropic/hh-rlhf)|22k|EN|MT|MIX

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback paper:
MIT License License:
(Hello-SimpleAI/HC3)|24K|EN|MT|MIX
(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (thu-coai/Safety-Prompts)|100k|CN|MT|MIX

Safety Assessment of Chinese Large Language Models paper:
Apache License 2.0 License:

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG

A General Language Assistant as a Laboratory for Alignment paper:
CC BY-SA 4.0 License:
stack-exchange-paired

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX

Instruction Tuning with GPT-4 paper:
CC BY-NC 4.0 License:
(tatsu-lab/Alpaca)|52K|EN|MT|SI 29,380 3 months ago

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets / (Reddit/eli5)|500k|EN|MT|HG

r/explainlikeimfive summary: This dataset contains questions and answers from the subreddits , and
eli5 dataset Related: a transformation of the dataset in a format similar to

Backlinks from these awesome lists: