MNBVC
Chinese text corpus
A massive corpus of Chinese text data covering various forms and styles
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
4k stars
66 watching
249 forks
last commit: 3 months ago chinesechinese-languagechinese-nlpchinese-simplifiedcorpus-datanlpnlp-machine-learning
Related projects:
Repository | Description | Stars |
---|---|---|
| A comprehensive guide to building applications with Large Language Models (LLMs) for developers | 12,377 |
| An experimental application showcasing GPT-4's capabilities through automation and AI-driven workflows. | 2,410 |
| Generates Chinese novel text using a pre-trained language model | 3,001 |
| Develops and publishes pre-trained Chinese language models using Whole Word Masking technology. | 9,746 |
| A toolkit for Chinese natural language processing tasks | 2,648 |
| Translation of a popular deep learning book into Chinese, aiming to improve accuracy and accessibility. | 35,890 |
| A large language model designed for multilingual and multimodal chat applications with advanced features such as long-text reasoning and high-performance inference. | 5,525 |
| A multimodal dialog language model that generates responses based on images and text | 4,110 |
| Develops and deploys a large language model for Chinese traditional medicine applications | 316 |
| Guides users to build applications using LangChain's framework and integrate it with Large Language Models (LLMs) for tasks like text generation, summarization, and search. | 7,533 |
| An open-source chatbot project built on Solid.js and OpenAI's GPT technology, with features like PWA support and customizable prompts. | 3,200 |
| An adapter layer for running web frontend code in WeChat Mini Programs | 4,811 |
| A framework for managing asynchronous threads with dynamic configuration and real-time monitoring | 5,631 |
| A repository providing a Chinese version of the GPT2 training code, utilizing BERT tokenizer. | 7,488 |
| A comprehensive, user-centered ecosystem of pre-trained NLP models for the Chinese language | 4,049 |