TALPCo
Asian language dataset
A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.
TUFS Asian Language Parallel Corpus
49 stars
2 watching
13 forks
Language: TeX
last commit: almost 2 years ago addresseebahasa-indonesiabahasa-melayuburmeseconstituency-treeenglishindonesianinterpersonaljapanesejavanesekoreanmalaymeaningmyanmarparallel-corpusthaitiengviettokenized-sentencestreebankvietnamese
Related projects:
Repository | Description | Stars |
---|---|---|
| A collection of datasets and tools for NLP tasks on Chinese texts, including part-of-speech tagging, named entity recognition, and question answering. | 529 |
| Provides a benchmark dataset and tools for training text summarization models in the Indonesian language. | 77 |
| A curated collection of NLP datasets and resources for Bahasa Indonesia | 496 |
| A dataset for training and fine-tuning large language models on Chinese text prompts. | 392 |
| A collection of speech datasets for 10 languages to support text-to-speech tasks | 467 |
| A collection of Urdu language datasets for various NLP tasks and applications | 71 |
| A collection of pre-processed datasets in Bangla language for natural language processing tasks | 0 |
| A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts. | 42 |
| Provides pre-trained ELMo representations for multiple languages to improve NLP tasks. | 1,462 |
| A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
| A manually tagged Indonesian language corpus in tab-separated file format | 88 |
| A repository of linguistic data for Indonesian words categorized as either standard or non-standard | 29 |
| A natural language processing toolkit for the Indonesian language. | 19 |
| Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
| An insurance industry conversation corpus with pre-processed data for natural language processing and question answering tasks. | 1,019 |