EVBCorpus

Bilingual Corpus

A large-scale bilingual corpus collection for language technology and NLP tasks, containing English-Vietnamese translations and bitexts.

The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.

GitHub

42 stars

3 watching

8 forks

last commit: about 7 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

keon/awesome-nlp

Related projects:

Repository	Description	Stars
bertez/corpora	A collection of Galician language data in JSON format.	2
kyubyong/wordvectors	Provides pre-trained word vectors for multiple languages to facilitate NLP tasks	2,216
cluebenchmark/cluecorpus2020	A large-scale Chinese corpus for pre-training language models.	927
vinairesearch/phobert	Pre-trained language models for Vietnamese NLP tasks	671
christos-c/bible-corpus	A multilingual parallel corpus created from translations of the Bible.	177
elte-dh/regenykorpusz	A large corpus of Hungarian novels with annotated texts and metadata, developed by the Department of Digital Humanities at Eötvös Loránd University.	4
edobashira/speech-language-processing	A curated collection of resources for building and utilizing speech and natural language processing systems.	2,206
nytud/hucola	A collection of 9,076 annotated sentences in Hungarian to evaluate linguistic acceptability and grammaticality	1
matbahasa/talpco	A parallel corpus of Asian languages with linguistic annotations and data formats for natural language processing research.	49
damoebius/haxebench	A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats.	52
poltextlab/hunempoli_corpus	A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language.	0
vadno/korkor_pilot	A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks.	2
nytud/hucopa	A dataset and annotation scheme for Hungarian causal reasoning tasks.	1
universaldependencies/ud_galician-ctg	This is a collection of annotated text data for the Galician language.	1
crscardellino/sbwce	A collection of linguistic resources and trained word embeddings for the Spanish language.	45