CLUECorpus2020
Corpus
A large-scale Chinese corpus for pre-training language models.
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
927 stars
20 watching
81 forks
last commit: over 2 years ago albertbertchinesechinese-corpuscorpusdatasetsnlppretrainroberta
Related projects:
Repository | Description | Stars |
---|---|---|
cluebenchmark/cluepretrainedmodels | Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models. | 806 |
cluebenchmark/electra | Trains and evaluates a Chinese language model using adversarial training on a large corpus. | 140 |
clue-ai/promptclue | A pre-trained language model for multiple natural language processing tasks with support for few-shot learning and transfer learning. | 656 |
cluebenchmark/supercluelyb | A benchmarking platform for evaluating Chinese general-purpose models through anonymous, random battles | 143 |
brightmart/xlnet_zh | Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
clue-ai/chatyuan | Large language model for dialogue support in multiple languages | 1,903 |
crownpku/small-chinese-corpus | A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. | 529 |
cluebenchmark/pclue | A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing. | 473 |
clue-ai/chatyuan-7b | An updated version of a large language model designed to improve performance on multiple tasks and datasets | 13 |
shannonai/chinesebert | A deep learning model that incorporates visual and phonetic features of Chinese characters to improve its ability to understand Chinese language nuances | 545 |
several27/fakenewscorpus | A large dataset of news articles with labeled categories to train fake news recognition algorithms | 385 |
soloice/chinese-character-recognition | This project demonstrates how to build and train a convolutional neural network (CNN) to recognize Chinese characters. | 200 |
hkust-knowcomp/jwe | This is a software project that trains and evaluates word embeddings for Chinese words, characters, and fine-grained subcharacter components. | 99 |
ymcui/macbert | Improves pre-trained Chinese language models by incorporating a correction task to alleviate inconsistency issues with downstream tasks | 646 |
zake7749/gossiping-chinese-corpus | A collection of question-answer pairs extracted from online Chinese forums. | 236 |