MNBVC

Chinese text corpus

A massive corpus of Chinese text data covering various forms and styles

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

GitHub

4k stars

66 watching

249 forks

last commit: over 1 year ago

chinesechinese-languagechinese-nlpchinese-simplifiedcorpus-datanlpnlp-machine-learning

Related projects:

Repository	Description	Stars
datawhalechina/llm-cookbook	A comprehensive guide to building applications with Large Language Models (LLMs) for developers	12,377
kaqijiang/auto-gpt-zh	An experimental application showcasing GPT-4's capabilities through automation and AI-driven workflows.	2,410
blinkdl/ai-writer	Generates Chinese novel text using a pre-trained language model	3,001
ymcui/chinese-bert-wwm	Develops and publishes pre-trained Chinese language models using Whole Word Masking technology.	9,746
fudannlp/fnlp	A toolkit for Chinese natural language processing tasks	2,648
exacity/deeplearningbook-chinese	Translation of a popular deep learning book into Chinese, aiming to improve accuracy and accessibility.	35,890
thudm/glm-4	A large language model designed for multilingual and multimodal chat applications with advanced features such as long-text reasoning and high-performance inference.	5,525
thudm/visualglm-6b	A multimodal dialog language model that generates responses based on images and text	4,110
michael-wzhu/shennong-tcm-llm	Develops and deploys a large language model for Chinese traditional medicine applications	316
liaokongvfx/langchain-chinese-getting-started-guide	Guides users to build applications using LangChain's framework and integrate it with Large Language Models (LLMs) for tasks like text generation, summarization, and search.	7,533
ourongxing/chatgpt-vercel	An open-source chatbot project built on Solid.js and OpenAI's GPT technology, with features like PWA support and customizable prompts.	3,200
tencent/kbone	An adapter layer for running web frontend code in WeChat Mini Programs	4,811
opengoofy/hippo4j	A framework for managing asynchronous threads with dynamic configuration and real-time monitoring	5,631
morizeyao/gpt2-chinese	A repository providing a Chinese version of the GPT2 training code, utilizing BERT tokenizer.	7,488
idea-ccnl/fengshenbang-lm	A comprehensive, user-centered ecosystem of pre-trained NLP models for the Chinese language	4,049