wongnai-corpus
Thai NLP Datasets
A collection of datasets for natural language processing research in Thai, including word segmentation and review rating prediction.
Collection of Wongnai's datasets
76 stars
6 watching
23 forks
last commit: about 5 years ago
Linked from 1 awesome list
datasetsnlpnlp-machine-learningtokenization
Related projects:
Repository | Description | Stars |
---|---|---|
louisowen6/nlp_bahasa_resources | A curated collection of NLP datasets and resources for Bahasa Indonesia | 489 |
karthikncode/nlp-datasets | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
krakenai/synthai | A deep learning-based project for segmenting Thai text into words and annotating parts of speech with high accuracy. | 41 |
pythainlp/lexicon-thai | A Thai language corpus and lexicon repository for natural language processing | 141 |
mirfan899/urdu | A collection of Urdu language datasets for various NLP tasks and applications | 71 |
pythainlp/pythainlp | A Python package for text processing and linguistic analysis focused on the Thai language. | 987 |
tmu-nlp/thaitoxicitytweetcorpus | Corpus of annotated Thai tweets to analyze toxicity and sentiment | 10 |
vinairesearch/phobert | Pre-trained language models for Vietnamese NLP tasks | 663 |
crownpku/small-chinese-corpus | A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. | 531 |
wannaphong/thai-ner | A Named Entity Recognition tool for the Thai language. | 53 |
pythainlp/prachathai-67k | An article classification dataset created from news articles scraped from Prachathai.com with multiple benchmark models for multi-label classification | 16 |
rkcosmos/deepcut | A Thai word tokenization library using Deep Neural Network | 420 |
ymcui/chinese-xlnet | Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture | 1,653 |
matbahasa/talpco | A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. | 49 |
zhuiyitechnology/pretrained-models | A collection of pre-trained language models for natural language processing tasks | 987 |