Small-Chinese-Corpus

Chinese text dataset suite

A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.

Some useful Chinese corpus datasets 中文语料小数据

GitHub

529 stars

33 watching

162 forks

last commit: over 6 years ago

Linked from 1 awesome list

chinese-nlpcorpus

Backlinks from these awesome lists:

endymecy/awesome-deeplearning-resources

Related projects:

Repository	Description	Stars
cluebenchmark/cluecorpus2020	A large-scale Chinese corpus for pre-training language models.	927
ymcui/cmrc2018	A collection of data for evaluating Chinese machine reading comprehension systems	419
zake7749/gossiping-chinese-corpus	A collection of question-answer pairs extracted from online Chinese forums.	236
cluebenchmark/cluepretrainedmodels	Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models.	806
chinese-poetry/chinese-poetry	A comprehensive JSON-based repository of Chinese poetry and related texts, aiming to facilitate development of applications using these ancient texts.	48,381
cluebenchmark/pclue	A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing.	473
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919
littleyuyu/stackoverflow-question-code-dataset	A collection of mined question-code pairs from Stack Overflow used for training and testing AI models	166
hit-scir/chinese-mixtral-8x7b	An implementation of a large language model for Chinese text processing, focusing on MoE (Multi-Headed Attention) architecture and incorporating a vast vocabulary.	645
matbahasa/talpco	A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.	49
nkcs-iclab/linglong	A pre-trained Chinese language model with a modest parameter count, designed to be accessible and useful for researchers with limited computing resources.	18
ydli-ai/csl	A large-scale dataset for natural language processing tasks focused on Chinese scientific literature, providing tools and benchmarks for NLP research.	582
thu-coai/cdial-gpt	A large-scale Chinese conversation dataset and pre-trained dialog models for text generation	1,799
brightmart/xlnet_zh	Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks	230
ymcui/chinese-xlnet	Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture	1,652