EVBCorpus
Bilingual Corpus
A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts.
The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.
42 stars
3 watching
8 forks
last commit: over 5 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
bertez/corpora | A collection of Galician language data in JSON format. | 2 |
kyubyong/wordvectors | Provides pre-trained word vectors for multiple languages to facilitate NLP tasks | 2,215 |
cluebenchmark/cluecorpus2020 | A large-scale Chinese corpus for pre-training language models. | 926 |
vinairesearch/phobert | Pre-trained language models for Vietnamese NLP tasks | 667 |
christos-c/bible-corpus | A multilingual parallel corpus created from translations of the Bible. | 177 |
elte-dh/regenykorpusz | A large corpus of Hungarian novels with annotated texts and metadata, developed by the Department of Digital Humanities at Eötvös Loránd University. | 4 |
edobashira/speech-language-processing | A curated collection of resources for building and utilizing speech and natural language processing systems. | 2,206 |
nytud/hucola | A large corpus of Hungarian sentences annotated for linguistic acceptability | 1 |
matbahasa/talpco | A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. | 49 |
damoebius/haxebench | A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats. | 52 |
poltextlab/hunempoli_corpus | A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language. | 0 |
vadno/korkor_pilot | A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks. | 2 |
nytud/hucopa | A corpus used to evaluate the ability of language models to select plausible alternatives based on causal relationships between premises and consequences. | 1 |
universaldependencies/ud_galician-ctg | This is a collection of annotated text data for the Galician language. | 1 |
crscardellino/sbwce | A collection of linguistic resources and trained word embeddings for the Spanish language. | 45 |