CLUECorpus2020

Corpus

A large-scale Chinese corpus for pre-training language models.

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

GitHub

927 stars

20 watching

81 forks

last commit: almost 4 years ago

albertbertchinesechinese-corpuscorpusdatasetsnlppretrainroberta

Screenshot of CLUEbenchmark/CLUECorpus2020 website

arxiv.org/abs/2003.01355

Related projects:

Repository	Description	Stars
cluebenchmark/cluepretrainedmodels	Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models.	806
cluebenchmark/electra	Trains and evaluates a Chinese language model using adversarial training on a large corpus.	140
clue-ai/promptclue	A pre-trained language model for multiple natural language processing tasks with support for few-shot learning and transfer learning.	656
cluebenchmark/supercluelyb	A benchmarking platform for evaluating Chinese general-purpose models through anonymous, random battles	143
brightmart/xlnet_zh	Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks	230
clue-ai/chatyuan	Large language model for dialogue support in multiple languages	1,903
crownpku/small-chinese-corpus	A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.	529
cluebenchmark/pclue	A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing.	473
clue-ai/chatyuan-7b	An updated version of a large language model designed to improve performance on multiple tasks and datasets	13
shannonai/chinesebert	A deep learning model that incorporates visual and phonetic features of Chinese characters to improve its ability to understand Chinese language nuances	545
several27/fakenewscorpus	A large dataset of news articles with labeled categories to train fake news recognition algorithms	385
soloice/chinese-character-recognition	This project demonstrates how to build and train a convolutional neural network (CNN) to recognize Chinese characters.	200
hkust-knowcomp/jwe	This is a software project that trains and evaluates word embeddings for Chinese words, characters, and fine-grained subcharacter components.	99
ymcui/macbert	Improves pre-trained Chinese language models by incorporating a correction task to alleviate inconsistency issues with downstream tasks	646
zake7749/gossiping-chinese-corpus	A collection of question-answer pairs extracted from online Chinese forums.	236