MNBVC

Chinese text corpus

A massive corpus of Chinese text data covering various forms and styles

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

GitHub

4k stars
66 watching
249 forks
last commit: 3 months ago
chinesechinese-languagechinese-nlpchinese-simplifiedcorpus-datanlpnlp-machine-learning

Related projects:

Repository Description Stars
datawhalechina/llm-cookbook A comprehensive guide to building applications with Large Language Models (LLMs) for developers 12,377
kaqijiang/auto-gpt-zh An experimental application showcasing GPT-4's capabilities through automation and AI-driven workflows. 2,410
blinkdl/ai-writer Generates Chinese novel text using a pre-trained language model 3,001
ymcui/chinese-bert-wwm Develops and publishes pre-trained Chinese language models using Whole Word Masking technology. 9,746
fudannlp/fnlp A toolkit for Chinese natural language processing tasks 2,648
exacity/deeplearningbook-chinese Translation of a popular deep learning book into Chinese, aiming to improve accuracy and accessibility. 35,890
thudm/glm-4 A large language model designed for multilingual and multimodal chat applications with advanced features such as long-text reasoning and high-performance inference. 5,525
thudm/visualglm-6b A multimodal dialog language model that generates responses based on images and text 4,110
michael-wzhu/shennong-tcm-llm Develops and deploys a large language model for Chinese traditional medicine applications 316
liaokongvfx/langchain-chinese-getting-started-guide Guides users to build applications using LangChain's framework and integrate it with Large Language Models (LLMs) for tasks like text generation, summarization, and search. 7,533
ourongxing/chatgpt-vercel An open-source chatbot project built on Solid.js and OpenAI's GPT technology, with features like PWA support and customizable prompts. 3,200
tencent/kbone An adapter layer for running web frontend code in WeChat Mini Programs 4,811
opengoofy/hippo4j A framework for managing asynchronous threads with dynamic configuration and real-time monitoring 5,631
morizeyao/gpt2-chinese A repository providing a Chinese version of the GPT2 training code, utilizing BERT tokenizer. 7,488
idea-ccnl/fengshenbang-lm A comprehensive, user-centered ecosystem of pre-trained NLP models for the Chinese language 4,049