Small-Chinese-Corpus
Chinese text dataset suite
A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.
Some useful Chinese corpus datasets 中文语料小数据
529 stars
33 watching
162 forks
last commit: almost 5 years ago
Linked from 1 awesome list
chinese-nlpcorpus
Related projects:
Repository | Description | Stars |
---|---|---|
| A large-scale Chinese corpus for pre-training language models. | 927 |
| A collection of data for evaluating Chinese machine reading comprehension systems | 419 |
| A collection of question-answer pairs extracted from online Chinese forums. | 236 |
| Provides pre-trained models for Chinese language tasks with improved performance and smaller model sizes compared to existing models. | 806 |
| A comprehensive JSON-based repository of Chinese poetry and related texts, aiming to facilitate development of applications using these ancient texts. | 48,381 |
| A large-scale dataset for training models to perform multiple tasks and zero-shot learning in natural language processing. | 473 |
| A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
| A collection of mined question-code pairs from Stack Overflow used for training and testing AI models | 166 |
| An implementation of a large language model for Chinese text processing, focusing on MoE (Multi-Headed Attention) architecture and incorporating a vast vocabulary. | 645 |
| A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. | 49 |
| A pre-trained Chinese language model with a modest parameter count, designed to be accessible and useful for researchers with limited computing resources. | 18 |
| A large-scale dataset for natural language processing tasks focused on Chinese scientific literature, providing tools and benchmarks for NLP research. | 582 |
| A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,799 |
| Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
| Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture | 1,652 |