MNBVC
Chinese Corpus Collection
Collects and provides access to a vast corpus of Chinese text data from various sources
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
4k stars
65 watching
246 forks
last commit: 19 days ago chinesechinese-languagechinese-nlpchinese-simplifiedcorpus-datanlpnlp-machine-learning
Related projects:
Repository | Description | Stars |
---|---|---|
datawhalechina/llm-cookbook | A comprehensive guide to building applications with Large Language Models (LLMs) for developers | 11,929 |
kaqijiang/auto-gpt-zh | An experimental application showcasing GPT-4's capabilities through automation and AI-driven workflows. | 2,404 |
blinkdl/ai-writer | Generates Chinese novel text using a pre-trained language model | 2,976 |
ymcui/chinese-bert-wwm | Develops and publishes pre-trained Chinese language models using Whole Word Masking technology. | 9,687 |
fudannlp/fnlp | A toolkit for Chinese natural language processing tasks | 2,647 |
exacity/deeplearningbook-chinese | Translation of a popular deep learning book into Chinese, aiming to improve accuracy and accessibility. | 35,804 |
thudm/glm-4 | Develops and releases pre-trained models for conversational AI tasks with enhanced capabilities on long text generation, multimodal interaction, and domain adaptation. | 5,277 |
thudm/visualglm-6b | A multimodal dialog language model that generates responses based on images and text | 4,094 |
michael-wzhu/shennong-tcm-llm | Develops and deploys a large language model for Chinese traditional medicine applications | 299 |
liaokongvfx/langchain-chinese-getting-started-guide | Guides users to build applications using LangChain's framework and integrate it with Large Language Models (LLMs) for tasks like text generation, summarization, and search. | 7,470 |
ourongxing/chatgpt-vercel | An open-source chatbot project built on Solid.js and OpenAI's GPT technology, with features like PWA support and customizable prompts. | 3,194 |
tencent/kbone | An adapter layer for running web frontend code in WeChat Mini Programs | 4,802 |
opengoofy/hippo4j | A framework for managing asynchronous threads with dynamic configuration and real-time monitoring | 5,595 |
morizeyao/gpt2-chinese | Training code for Chinese versions of the GPT2 language model using BERT tokenizer or BPE model. | 7,467 |
idea-ccnl/fengshenbang-lm | A comprehensive, user-centered ecosystem of pre-trained NLP models for the Chinese language | 4,022 |