awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

GitHub

157 stars
11 watching
14 forks
last commit: 7 months ago
Linked from 1 awesome list

awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp

awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual

Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News
Brown-UK 110 15 days ago — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
Wikipedia
OSCAR — shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated
CC-100 — documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text
mC4 11,741 almost 2 years ago — filtered CommonCrawl again, 196GB of Ukrainian text
Ukrainian Twitter corpus 15 over 5 years ago Ukrainian Twitter corpus for toxic text detection
Ukrainian forums 3 over 7 years ago — 250k sentences scraped from forums
Ukrainain news headlines — 5.2M news headlines

awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel

OPUS
Tatoeba MT Challenge data sets 799 about 2 months ago
Polish-Ukrainian Parallel Corpus
Back-translated monolingual Wiki data 799 about 2 months ago
Wiki Edits — 5M sentence edits extracted from the Ukrainian Wikipedia revision history

awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled

ZNO — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO)
UA-GEC 255 8 months ago — grammatical error correction (GEC) and fluency corpus
NER-uk 89 6 months ago — Brown-UK labeled for named entities
Yakaboo Book Reviews — book reviews, ratings and descriptions
Universal Dependencies 28 5 months ago — dependency trees corpus
ua-news 53 2 months ago — 150k news article in 5 categories
UA-SQuAD 53 2 months ago — Ukrainian version of Stanford Question Answering Dataset
Ukrainian Winograd schema challenge (WSC) Dataset 6 10 months ago — manually translated
Ukrainian OntoNotes Dataset 6 10 months ago — scripts to build large silver dataset for coreference resolution

awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries

ВЕСУМ 556 6 days ago — POS tag dictionary. Can generate a list of all word forms valid for spelling
Tonal dictionary 46 about 8 years ago
Multilingualsentiment, includes Ukrainian a list of positive/negative words
obscene-ukr 16 over 3 years ago — profanity dictionary
Word stress dictionary 18 6 days ago — word stress for 2.7M word forms. See
Heteronyms 3 about 2 years ago — words that share the same spelling but have different meaning/pronunciation
Abbreviations 3 over 2 years ago — map abbreviation to expansion

awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts

Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts

awesome-ukrainian-nlp / 2. Tools

tree_stem 28 almost 2 years ago — stemmer
pymorphy2 1,114 3 months ago + — POS tagger and lemmatizer
LanguageTool — grammar, style and spell checker
Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
nlp-uk 70 6 days ago — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
NLP-Cube 550 6 months ago Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing

awesome-ukrainian-nlp / 3. Pretrained models / Language models

aya-101 — massively multilingual LM, 13B parameters
pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian
UAlpaca 79 3 months ago — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset
XGLM 30,200 26 days ago — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian
Tereveni-AI/GPT-2
uk4b 16 about 1 year ago and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books
xlm-roberta-base-uk — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left
youscan/ukr-roberta-base
Electra

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

Helsinki-NLP / OPUS-MT models 29 over 2 years ago — Ukrainian to/from 25 langaguages

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models

OPUS-MT models at HuggingFace
OPUS-MT models evaluated on flores101 29 over 2 years ago

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

M2M-100 30,200 26 days ago — Ukrainian to/from 100 languages

awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models

mBART50 30,200 26 days ago
mT5 1,244 almost 2 years ago

awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER)

MITIE NER Model
ukr-models/uk-ner
lang-uk/flair-uk-ner
dchaplinsky/uk_ner_web_trf_large

awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS)

lang-uk/flair-uk-pos

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText

Official fastText trained on CommonCrawl and Wiki — 157 languages, including Ukrainian
Older official fastText trained on Wiki 25,869 7 months ago — 294 languages, including Ukrainian
fastText_multilingual 1,195 over 1 year ago — 78 languages, aligned to the same vector space
fasttext_uk (2023) and — trained on UberText 2.0

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings

Word2Vec
GloVe
LexVec
BPEmb: Subword Embeddings, includes Ukrainian easy to use with
Flair 13,852 7 days ago — added in 2022

awesome-ukrainian-nlp / 3. Pretrained models / Other

uk-punctcase — punctuation and case restoration model based on XLM-RoBERTa-Uk
punctuation_uk_bert — another punctation and case restoration model based on bert-base-multilingual-cased
ukrainian-word-stress 44 6 days ago — adds word stress

awesome-ukrainian-nlp / 4. Paid

LORELEI Ukrainian Representative Language Pack Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
Helsinki-NLP/ UkrainianLT 29 over 2 years ago — another collection of links to Ukrainian language tools
egorsmkv / speech-recognition-uk 336 about 1 month ago — speech recognition and text-to-speech models and datasets

awesome-ukrainian-nlp / 6. Workshops and conferences

Ukrainian Natural Language Processing Workshop

awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian

Training data and evaluation scripts 6 over 1 year ago
Public leaderboard

awesome-ukrainian-nlp / 6. Workshops and conferences

UNLP 2024 shared task 13 6 months ago — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian

Backlinks from these awesome lists: