MNBVC

Chinese Corpus Collection

Collects and provides access to a vast corpus of Chinese text data from various sources

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

GitHub

4k stars
65 watching
246 forks
last commit: 19 days ago
chinesechinese-languagechinese-nlpchinese-simplifiedcorpus-datanlpnlp-machine-learning

Related projects:

Repository Description Stars
datawhalechina/llm-cookbook A comprehensive guide to building applications with Large Language Models (LLMs) for developers 11,929
kaqijiang/auto-gpt-zh An experimental application showcasing GPT-4's capabilities through automation and AI-driven workflows. 2,404
blinkdl/ai-writer Generates Chinese novel text using a pre-trained language model 2,976
ymcui/chinese-bert-wwm Develops and publishes pre-trained Chinese language models using Whole Word Masking technology. 9,687
fudannlp/fnlp A toolkit for Chinese natural language processing tasks 2,647
exacity/deeplearningbook-chinese Translation of a popular deep learning book into Chinese, aiming to improve accuracy and accessibility. 35,804
thudm/glm-4 Develops and releases pre-trained models for conversational AI tasks with enhanced capabilities on long text generation, multimodal interaction, and domain adaptation. 5,277
thudm/visualglm-6b A multimodal dialog language model that generates responses based on images and text 4,094
michael-wzhu/shennong-tcm-llm Develops and deploys a large language model for Chinese traditional medicine applications 299
liaokongvfx/langchain-chinese-getting-started-guide Guides users to build applications using LangChain's framework and integrate it with Large Language Models (LLMs) for tasks like text generation, summarization, and search. 7,470
ourongxing/chatgpt-vercel An open-source chatbot project built on Solid.js and OpenAI's GPT technology, with features like PWA support and customizable prompts. 3,194
tencent/kbone An adapter layer for running web frontend code in WeChat Mini Programs 4,802
opengoofy/hippo4j A framework for managing asynchronous threads with dynamic configuration and real-time monitoring 5,595
morizeyao/gpt2-chinese Training code for Chinese versions of the GPT2 language model using BERT tokenizer or BPE model. 7,467
idea-ccnl/fengshenbang-lm A comprehensive, user-centered ecosystem of pre-trained NLP models for the Chinese language 4,022