Small-Chinese-Corpus
Chinese text dataset suite
A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.
Some useful Chinese corpus datasets 中文语料小数据
531 stars
33 watching
162 forks
last commit: over 4 years ago
Linked from 1 awesome list
chinese-nlpcorpus
Related projects:
Repository | Description | Stars |
---|---|---|
cluebenchmark/cluecorpus2020 | A large-scale pre-training corpus for Chinese language models | 925 |
ymcui/cmrc2018 | A collection of data for evaluating Chinese machine reading comprehension systems | 415 |
zake7749/gossiping-chinese-corpus | A collection of question-answer pairs extracted from online Chinese forums. | 238 |
cluebenchmark/cluepretrainedmodels | Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models. | 804 |
chinese-poetry/chinese-poetry | A comprehensive database of Chinese poetry and related texts, providing structured data for use in software development projects. | 48,210 |
cluebenchmark/pclue | A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing. | 468 |
karthikncode/nlp-datasets | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
littleyuyu/stackoverflow-question-code-dataset | A collection of mined question-code pairs from Stack Overflow used for training and testing AI models | 165 |
hit-scir/chinese-mixtral-8x7b | An implementation of a large language model for Chinese text processing, focusing on MoE (Multi-Headed Attention) architecture and incorporating a vast vocabulary. | 641 |
matbahasa/talpco | A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. | 49 |
nkcs-iclab/linglong | A pre-trained Chinese language model with a modest parameter count, designed to be accessible and useful for researchers with limited computing resources. | 17 |
ydli-ai/csl | A large-scale dataset for natural language processing tasks focused on Chinese scientific literature, providing tools and benchmarks for NLP research. | 568 |
thu-coai/cdial-gpt | A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,782 |
brightmart/xlnet_zh | Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
ymcui/chinese-xlnet | Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture | 1,653 |