CLUECorpus2020

Chinese corpus

A large-scale pre-training corpus for Chinese language models

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

GitHub

925 stars
21 watching
82 forks
last commit: about 2 years ago
albertbertchinesechinese-corpuscorpusdatasetsnlppretrainroberta

Related projects:

Repository Description Stars
cluebenchmark/cluepretrainedmodels Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models. 804
cluebenchmark/electra Trains and evaluates a Chinese language model using adversarial training on a large corpus. 140
clue-ai/promptclue A pre-trained language model for multiple natural language processing tasks with support for few-shot learning and transfer learning. 654
cluebenchmark/supercluelyb A benchmarking platform for evaluating Chinese general-purpose models through anonymous, random battles 141
brightmart/xlnet_zh Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks 230
clue-ai/chatyuan Large language model for dialogue support in multiple languages 1,902
crownpku/small-chinese-corpus A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. 531
cluebenchmark/pclue A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing. 468
clue-ai/chatyuan-7b An updated version of a large language model designed to improve performance on multiple tasks and datasets 13
shannonai/chinesebert A deep learning model that incorporates visual and phonetic features of Chinese characters to improve its ability to understand Chinese language nuances 542
several27/fakenewscorpus A large dataset of news articles with labeled categories to train fake news recognition algorithms 385
soloice/chinese-character-recognition This project demonstrates how to build and train a convolutional neural network (CNN) to recognize Chinese characters. 200
hkust-knowcomp/jwe This is a software project that trains and evaluates word embeddings for Chinese words, characters, and fine-grained subcharacter components. 99
ymcui/macbert Improves pre-trained Chinese language models by incorporating a correction task to alleviate inconsistency issues with downstream tasks 645
zake7749/gossiping-chinese-corpus A collection of question-answer pairs extracted from online Chinese forums. 238