TALPCo
Asian language dataset
A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.
TUFS Asian Language Parallel Corpus
49 stars
2 watching
13 forks
Language: TeX
last commit: over 1 year ago addresseebahasa-indonesiabahasa-melayuburmeseconstituency-treeenglishindonesianinterpersonaljapanesejavanesekoreanmalaymeaningmyanmarparallel-corpusthaitiengviettokenized-sentencestreebankvietnamese
Related projects:
Repository | Description | Stars |
---|---|---|
crownpku/small-chinese-corpus | A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. | 531 |
kata-ai/indosum | Provides a benchmark dataset and tools for training text summarization models in the Indonesian language. | 76 |
louisowen6/nlp_bahasa_resources | A curated collection of NLP datasets and resources for Bahasa Indonesia | 489 |
carbonz0/alpaca-chinese-dataset | A dataset for training and fine-tuning large language models on Chinese text prompts. | 390 |
kyubyong/css10 | A collection of speech datasets for 10 languages to support text-to-speech tasks | 465 |
mirfan899/urdu | A collection of Urdu language datasets for various NLP tasks and applications | 71 |
atik-05/bangla_datasets_absa | A collection of pre-processed datasets in Bangla language for natural language processing tasks | 0 |
qhungngo/evbcorpus | A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts. | 42 |
hit-scir/elmoformanylangs | Provides pre-trained ELMo representations for multiple languages to improve NLP tasks. | 1,463 |
karthikncode/nlp-datasets | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
famrashel/idn-tagged-corpus | A manually tagged Indonesian language corpus in tab-separated file format | 88 |
lantip/baku-tidak-baku | A repository of linguistic data for Indonesian words categorized as either standard or non-standard | 29 |
kangfend/bahasa | A natural language processing toolkit for the Indonesian language. | 19 |
brightmart/xlnet_zh | Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
chatopera/insuranceqa-corpus-zh | An insurance industry conversation corpus with pre-processed data for natural language processing and question answering tasks. | 1,020 |