awesome-ukrainian-nlp

NLP toolkit

A curated collection of Ukrainian NLP resources for research and development.

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

GitHub

166 stars
11 watching
14 forks
last commit: about 1 month ago
Linked from 1 awesome list

awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp

awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual

Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News
Brown-UK 110 2 months ago — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
Wikipedia
OSCAR — shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated
CC-100 — documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text
mC4 11,757 almost 2 years ago — filtered CommonCrawl again, 196GB of Ukrainian text
Ukrainian Twitter corpus 15 over 5 years ago Ukrainian Twitter corpus for toxic text detection
Ukrainian forums 3 over 7 years ago — 250k sentences scraped from forums
Ukrainain news headlines — 5.2M news headlines

awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel

OPUS
Tatoeba MT Challenge data sets 803 3 months ago
Polish-Ukrainian Parallel Corpus
Back-translated monolingual Wiki data 803 3 months ago
Wiki Edits — 5M sentence edits extracted from the Ukrainian Wikipedia revision history

awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled

ZNO — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO)
UA-GEC 255 9 months ago — grammatical error correction (GEC) and fluency corpus
NER-uk 90 8 months ago — Brown-UK labeled for named entities
Yakaboo Book Reviews — book reviews, ratings and descriptions
Universal Dependencies 28 9 days ago — dependency trees corpus
ua-news 55 4 months ago — 150k news article in 5 categories
UA-SQuAD 55 4 months ago — Ukrainian version of Stanford Question Answering Dataset
Ukrainian Winograd schema challenge (WSC) Dataset 7 11 months ago — manually translated
Ukrainian OntoNotes Dataset 7 11 months ago — scripts to build large silver dataset for coreference resolution

awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries

ВЕСУМ 561 24 days ago — POS tag dictionary. Can generate a list of all word forms valid for spelling
Tonal dictionary 47 about 8 years ago
Multilingualsentiment, includes Ukrainian a list of positive/negative words
obscene-ukr 17 over 3 years ago — profanity dictionary
Word stress dictionary 19 about 2 months ago — word stress for 2.7M word forms. See
Heteronyms 3 over 2 years ago — words that share the same spelling but have different meaning/pronunciation
Abbreviations 3 almost 3 years ago — map abbreviation to expansion

awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts

Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts

awesome-ukrainian-nlp / 2. Tools

tree_stem 28 almost 2 years ago — stemmer
pymorphy2 1,123 5 months ago + — POS tagger and lemmatizer
LanguageTool — grammar, style and spell checker
Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
nlp-uk 72 23 days ago — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
NLP-Cube 554 18 days ago Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing

awesome-ukrainian-nlp / 3. Pretrained models / Language models

aya-101 — massively multilingual LM, 13B parameters
pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian
UAlpaca 84 4 months ago — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset
XGLM 30,522 about 1 month ago — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian
Tereveni-AI/GPT-2
uk4b 18 over 1 year ago and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books
xlm-roberta-base-uk — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left
youscan/ukr-roberta-base
Electra

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

Helsinki-NLP / OPUS-MT models 30 over 2 years ago — Ukrainian to/from 25 langaguages

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models

OPUS-MT models at HuggingFace
OPUS-MT models evaluated on flores101 30 over 2 years ago

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

M2M-100 30,522 about 1 month ago — Ukrainian to/from 100 languages
Uk-En folktale corpus 0 about 2 years ago — small sentence-aligned corpus of fairy tales

awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models

mBART50 30,522 about 1 month ago
mT5 1,252 almost 2 years ago

awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER)

MITIE NER Model
ukr-models/uk-ner
lang-uk/flair-uk-ner
dchaplinsky/uk_ner_web_trf_large

awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS)

lang-uk/flair-uk-pos

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText

Official fastText trained on CommonCrawl and Wiki — 157 languages, including Ukrainian
Older official fastText trained on Wiki 25,945 8 months ago — 294 languages, including Ukrainian
fastText_multilingual 1,197 over 1 year ago — 78 languages, aligned to the same vector space
fasttext_uk (2023) and — trained on UberText 2.0

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings

Word2Vec
GloVe
LexVec
BPEmb: Subword Embeddings, includes Ukrainian easy to use with
Flair 13,939 6 days ago — added in 2022

awesome-ukrainian-nlp / 3. Pretrained models / Other

uk-punctcase — punctuation and case restoration model based on XLM-RoBERTa-Uk
punctuation_uk_bert — another punctation and case restoration model based on bert-base-multilingual-cased
ukrainian-word-stress 45 about 2 months ago — adds word stress

awesome-ukrainian-nlp / 4. Paid

LORELEI Ukrainian Representative Language Pack Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
Helsinki-NLP/ UkrainianLT 30 over 2 years ago — another collection of links to Ukrainian language tools
egorsmkv / speech-recognition-uk 342 24 days ago — speech recognition and text-to-speech models and datasets

awesome-ukrainian-nlp / 6. Workshops and conferences

Ukrainian Natural Language Processing Workshop

awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian

Training data and evaluation scripts 7 over 1 year ago
Public leaderboard

awesome-ukrainian-nlp / 6. Workshops and conferences

UNLP 2024 shared task 13 7 months ago — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian

Backlinks from these awesome lists:

More related projects: