EVBCorpus

Bilingual Corpus

A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts.

The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.

GitHub

42 stars
3 watching
8 forks
last commit: over 5 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
bertez/corpora A collection of Galician language data in JSON format. 2
kyubyong/wordvectors Provides pre-trained word vectors for multiple languages to facilitate NLP tasks 2,215
cluebenchmark/cluecorpus2020 A large-scale Chinese corpus for pre-training language models. 926
vinairesearch/phobert Pre-trained language models for Vietnamese NLP tasks 667
christos-c/bible-corpus A multilingual parallel corpus created from translations of the Bible. 177
elte-dh/regenykorpusz A large corpus of Hungarian novels with annotated texts and metadata, developed by the Department of Digital Humanities at Eötvös Loránd University. 4
edobashira/speech-language-processing A curated collection of resources for building and utilizing speech and natural language processing systems. 2,206
nytud/hucola A large corpus of Hungarian sentences annotated for linguistic acceptability 1
matbahasa/talpco A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research. 49
damoebius/haxebench A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats. 52
poltextlab/hunempoli_corpus A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language. 0
vadno/korkor_pilot A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks. 2
nytud/hucopa A corpus used to evaluate the ability of language models to select plausible alternatives based on causal relationships between premises and consequences. 1
universaldependencies/ud_galician-ctg This is a collection of annotated text data for the Galician language. 1
crscardellino/sbwce A collection of linguistic resources and trained word embeddings for the Spanish language. 45