TALPCo

Asian language dataset

A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.

TUFS Asian Language Parallel Corpus

GitHub

49 stars
2 watching
13 forks
Language: TeX
last commit: over 1 year ago
addresseebahasa-indonesiabahasa-melayuburmeseconstituency-treeenglishindonesianinterpersonaljapanesejavanesekoreanmalaymeaningmyanmarparallel-corpusthaitiengviettokenized-sentencestreebankvietnamese

Related projects:

Repository Description Stars
crownpku/small-chinese-corpus A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. 531
kata-ai/indosum Provides a benchmark dataset and tools for training text summarization models in the Indonesian language. 76
louisowen6/nlp_bahasa_resources A curated collection of NLP datasets and resources for Bahasa Indonesia 489
carbonz0/alpaca-chinese-dataset A dataset for training and fine-tuning large language models on Chinese text prompts. 390
kyubyong/css10 A collection of speech datasets for 10 languages to support text-to-speech tasks 465
mirfan899/urdu A collection of Urdu language datasets for various NLP tasks and applications 71
atik-05/bangla_datasets_absa A collection of pre-processed datasets in Bangla language for natural language processing tasks 0
qhungngo/evbcorpus A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts. 42
hit-scir/elmoformanylangs Provides pre-trained ELMo representations for multiple languages to improve NLP tasks. 1,463
karthikncode/nlp-datasets A curated list of Natural Language Processing datasets used to train and evaluate NLP models. 919
famrashel/idn-tagged-corpus A manually tagged Indonesian language corpus in tab-separated file format 88
lantip/baku-tidak-baku A repository of linguistic data for Indonesian words categorized as either standard or non-standard 29
kangfend/bahasa A natural language processing toolkit for the Indonesian language. 19
brightmart/xlnet_zh Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks 230
chatopera/insuranceqa-corpus-zh An insurance industry conversation corpus with pre-processed data for natural language processing and question answering tasks. 1,020