TALPCo

Asian language dataset

A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.

TUFS Asian Language Parallel Corpus

GitHub

49 stars

2 watching

13 forks

Language: TeX

last commit: about 3 years ago

addresseebahasa-indonesiabahasa-melayuburmeseconstituency-treeenglishindonesianinterpersonaljapanesejavanesekoreanmalaymeaningmyanmarparallel-corpusthaitiengviettokenized-sentencestreebankvietnamese

Related projects:

Repository	Description	Stars
crownpku/small-chinese-corpus	A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering.	529
kata-ai/indosum	Provides a benchmark dataset and tools for training text summarization models in the Indonesian language.	77
louisowen6/nlp_bahasa_resources	A curated collection of NLP datasets and resources for Bahasa Indonesia	496
carbonz0/alpaca-chinese-dataset	A dataset for training and fine-tuning large language models on Chinese text prompts.	392
kyubyong/css10	A collection of speech datasets for 10 languages to support text-to-speech tasks	467
mirfan899/urdu	A collection of Urdu language datasets for various NLP tasks and applications	71
atik-05/bangla_datasets_absa	A collection of pre-processed datasets in Bangla language for natural language processing tasks	0
qhungngo/evbcorpus	A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts.	42
hit-scir/elmoformanylangs	Provides pre-trained ELMo representations for multiple languages to improve NLP tasks.	1,462
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919
famrashel/idn-tagged-corpus	A manually tagged Indonesian language corpus in tab-separated file format	88
lantip/baku-tidak-baku	A repository of linguistic data for Indonesian words categorized as either standard or non-standard	29
kangfend/bahasa	A natural language processing toolkit for the Indonesian language.	19
brightmart/xlnet_zh	Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks	230
chatopera/insuranceqa-corpus-zh	An insurance industry conversation corpus with pre-processed data for natural language processing and question answering tasks.	1,019