Small-Chinese-Corpus

Chinese text dataset suite

A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.

Some useful Chinese corpus datasets 中文语料小数据

GitHub

531 stars
33 watching
162 forks
last commit: over 4 years ago
Linked from 1 awesome list

chinese-nlpcorpus

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
cluebenchmark/cluecorpus2020 A large-scale pre-training corpus for Chinese language models 925
ymcui/cmrc2018 A collection of data for evaluating Chinese machine reading comprehension systems 415
zake7749/gossiping-chinese-corpus A collection of question-answer pairs extracted from online Chinese forums. 238
cluebenchmark/cluepretrainedmodels Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models. 804
chinese-poetry/chinese-poetry A comprehensive database of Chinese poetry and related classical texts 48,171
cluebenchmark/pclue A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing. 468
karthikncode/nlp-datasets A curated list of Natural Language Processing datasets used to train and evaluate NLP models. 919
littleyuyu/stackoverflow-question-code-dataset A collection of mined question-code pairs from Stack Overflow used for training and testing AI models 165
hit-scir/chinese-mixtral-8x7b An implementation of a large language model for Chinese text processing, focusing on MoE (Multi-Headed Attention) architecture and incorporating a vast vocabulary. 641
matbahasa/talpco A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. 49
nkcs-iclab/linglong A pre-trained Chinese language model with a modest parameter count, designed to be accessible and useful for researchers with limited computing resources. 17
ydli-ai/csl A large-scale dataset for natural language processing tasks focused on Chinese scientific literature, providing tools and benchmarks for NLP research. 568
thu-coai/cdial-gpt A large-scale Chinese conversation dataset and pre-trained dialog models for text generation 1,782
brightmart/xlnet_zh Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks 230
ymcui/chinese-xlnet Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture 1,653