wongnai-corpus

Thai NLP Datasets

A collection of datasets for natural language processing research in Thai, including word segmentation and review rating prediction.

Collection of Wongnai's datasets

GitHub

76 stars

6 watching

23 forks

last commit: almost 7 years ago

Linked from 1 awesome list

datasetsnlpnlp-machine-learningtokenization

Backlinks from these awesome lists:

kobkrit/nlp_thai_resources

Related projects:

Repository	Description	Stars
louisowen6/nlp_bahasa_resources	A curated collection of NLP datasets and resources for Bahasa Indonesia	496
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919
krakenai/synthai	A deep learning-based project for segmenting Thai text into words and annotating parts of speech with high accuracy.	41
pythainlp/lexicon-thai	A Thai language corpus and lexicon repository for natural language processing	142
mirfan899/urdu	A collection of Urdu language datasets for various NLP tasks and applications	71
pythainlp/pythainlp	A Python package for text processing and linguistic analysis focused on Thai language	993
tmu-nlp/thaitoxicitytweetcorpus	Corpus of annotated Thai tweets to analyze toxicity and sentiment	10
vinairesearch/phobert	Pre-trained language models for Vietnamese NLP tasks	671
crownpku/small-chinese-corpus	A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.	529
wannaphong/thai-ner	Named Entity Recognition for Thai Text using PyThaiNLP and custom implementation.	53
pythainlp/prachathai-67k	An article classification dataset created from news articles scraped from Prachathai.com with multiple benchmark models for multi-label classification	16
rkcosmos/deepcut	A Thai word tokenization library using Deep Neural Network	421
ymcui/chinese-xlnet	Provides pre-trained models for Chinese natural language processing tasks using the XLNet architecture	1,652
matbahasa/talpco	A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.	49
zhuiyitechnology/pretrained-models	A collection of pre-trained language models for natural language processing tasks	989