awesome-ukrainian-nlp
NLP toolkit
A curated collection of Ukrainian NLP resources for research and development.
Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
166 stars
11 watching
14 forks
last commit: about 1 month ago
Linked from 1 awesome list
awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp
awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual | |||
Malyuk | — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News | ||
Brown-UK | 110 | 2 months ago | — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words |
UberText 2.0 | — over 5 GB of news, Wikipedia, social, fiction, and legal texts | ||
Wikipedia | |||
OSCAR | — shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated | ||
CC-100 | — documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text | ||
mC4 | 11,757 | almost 2 years ago | — filtered CommonCrawl again, 196GB of Ukrainian text |
Ukrainian Twitter corpus | 15 | over 5 years ago | Ukrainian Twitter corpus for toxic text detection |
Ukrainian forums | 3 | over 7 years ago | — 250k sentences scraped from forums |
Ukrainain news headlines | — 5.2M news headlines | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel | |||
OPUS | |||
Tatoeba MT Challenge data sets | 803 | 3 months ago | |
Polish-Ukrainian Parallel Corpus | |||
Back-translated monolingual Wiki data | 803 | 3 months ago | |
Wiki Edits | — 5M sentence edits extracted from the Ukrainian Wikipedia revision history | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled | |||
ZNO | — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO) | ||
UA-GEC | 255 | 9 months ago | — grammatical error correction (GEC) and fluency corpus |
NER-uk | 90 | 8 months ago | — Brown-UK labeled for named entities |
Yakaboo Book Reviews | — book reviews, ratings and descriptions | ||
Universal Dependencies | 28 | 9 days ago | — dependency trees corpus |
ua-news | 55 | 4 months ago | — 150k news article in 5 categories |
UA-SQuAD | 55 | 4 months ago | — Ukrainian version of Stanford Question Answering Dataset |
Ukrainian Winograd schema challenge (WSC) Dataset | 7 | 11 months ago | — manually translated |
Ukrainian OntoNotes Dataset | 7 | 11 months ago | — scripts to build large silver dataset for coreference resolution |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries | |||
ВЕСУМ | 561 | 24 days ago | — POS tag dictionary. Can generate a list of all word forms valid for spelling |
Tonal dictionary | 47 | about 8 years ago | |
Multilingualsentiment, includes Ukrainian | a list of positive/negative words | ||
obscene-ukr | 17 | over 3 years ago | — profanity dictionary |
Word stress dictionary | 19 | about 2 months ago | — word stress for 2.7M word forms. See |
Heteronyms | 3 | over 2 years ago | — words that share the same spelling but have different meaning/pronunciation |
Abbreviations | 3 | almost 3 years ago | — map abbreviation to expansion |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts | |||
Aya | — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts | ||
awesome-ukrainian-nlp / 2. Tools | |||
tree_stem | 28 | almost 2 years ago | — stemmer |
pymorphy2 | 1,123 | 5 months ago | + — POS tagger and lemmatizer |
LanguageTool | — grammar, style and spell checker | ||
Stanza | — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER | ||
nlp-uk | 72 | 23 days ago | — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation |
NLP-Cube | 554 | 18 days ago | Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing |
awesome-ukrainian-nlp / 3. Pretrained models / Language models | |||
aya-101 | — massively multilingual LM, 13B parameters | ||
pythia-uk | — mT5 finetuned on wiki and oasst1 for chats in Ukrainian | ||
UAlpaca | 84 | 4 months ago | — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset |
XGLM | 30,522 | about 1 month ago | — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian |
Tereveni-AI/GPT-2 | |||
uk4b | 18 | over 1 year ago | and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books |
xlm-roberta-base-uk | — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left | ||
youscan/ukr-roberta-base | |||
Electra | |||
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
Helsinki-NLP / OPUS-MT models | 30 | over 2 years ago | — Ukrainian to/from 25 langaguages |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models | |||
OPUS-MT models at HuggingFace | |||
OPUS-MT models evaluated on flores101 | 30 | over 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
M2M-100 | 30,522 | about 1 month ago | — Ukrainian to/from 100 languages |
Uk-En folktale corpus | 0 | about 2 years ago | — small sentence-aligned corpus of fairy tales |
awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models | |||
mBART50 | 30,522 | about 1 month ago | |
mT5 | 1,252 | almost 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER) | |||
MITIE NER Model | |||
ukr-models/uk-ner | |||
lang-uk/flair-uk-ner | |||
dchaplinsky/uk_ner_web_trf_large | |||
awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS) | |||
lang-uk/flair-uk-pos | |||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText | |||
Official fastText trained on CommonCrawl and Wiki | — 157 languages, including Ukrainian | ||
Older official fastText trained on Wiki | 25,945 | 8 months ago | — 294 languages, including Ukrainian |
fastText_multilingual | 1,197 | over 1 year ago | — 78 languages, aligned to the same vector space |
fasttext_uk (2023) | and — trained on UberText 2.0 | ||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings | |||
Word2Vec | |||
GloVe | |||
LexVec | |||
BPEmb: Subword Embeddings, includes Ukrainian | easy to use with | ||
Flair | 13,939 | 6 days ago | — added in 2022 |
awesome-ukrainian-nlp / 3. Pretrained models / Other | |||
uk-punctcase | — punctuation and case restoration model based on XLM-RoBERTa-Uk | ||
punctuation_uk_bert | — another punctation and case restoration model based on bert-base-multilingual-cased | ||
ukrainian-word-stress | 45 | about 2 months ago | — adds word stress |
awesome-ukrainian-nlp / 4. Paid | |||
LORELEI Ukrainian Representative Language Pack | Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities | ||
awesome-ukrainian-nlp / 5. Other resources and links | |||
Helsinki-NLP/ UkrainianLT | 30 | over 2 years ago | — another collection of links to Ukrainian language tools |
egorsmkv / speech-recognition-uk | 342 | 24 days ago | — speech recognition and text-to-speech models and datasets |
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
Ukrainian Natural Language Processing Workshop | |||
awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian | |||
Training data and evaluation scripts | 7 | over 1 year ago | |
Public leaderboard | |||
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
UNLP 2024 shared task | 13 | 7 months ago | — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian |