awesome-ukrainian-nlp
Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
157 stars
11 watching
14 forks
last commit: 7 months ago
Linked from 1 awesome list
awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp
awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual | |||
Malyuk | — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News | ||
Brown-UK | 110 | 15 days ago | — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words |
UberText 2.0 | — over 5 GB of news, Wikipedia, social, fiction, and legal texts | ||
Wikipedia | |||
OSCAR | — shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated | ||
CC-100 | — documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text | ||
mC4 | 11,741 | almost 2 years ago | — filtered CommonCrawl again, 196GB of Ukrainian text |
Ukrainian Twitter corpus | 15 | over 5 years ago | Ukrainian Twitter corpus for toxic text detection |
Ukrainian forums | 3 | over 7 years ago | — 250k sentences scraped from forums |
Ukrainain news headlines | — 5.2M news headlines | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel | |||
OPUS | |||
Tatoeba MT Challenge data sets | 799 | about 2 months ago | |
Polish-Ukrainian Parallel Corpus | |||
Back-translated monolingual Wiki data | 799 | about 2 months ago | |
Wiki Edits | — 5M sentence edits extracted from the Ukrainian Wikipedia revision history | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled | |||
ZNO | — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO) | ||
UA-GEC | 255 | 8 months ago | — grammatical error correction (GEC) and fluency corpus |
NER-uk | 89 | 6 months ago | — Brown-UK labeled for named entities |
Yakaboo Book Reviews | — book reviews, ratings and descriptions | ||
Universal Dependencies | 28 | 5 months ago | — dependency trees corpus |
ua-news | 53 | 2 months ago | — 150k news article in 5 categories |
UA-SQuAD | 53 | 2 months ago | — Ukrainian version of Stanford Question Answering Dataset |
Ukrainian Winograd schema challenge (WSC) Dataset | 6 | 10 months ago | — manually translated |
Ukrainian OntoNotes Dataset | 6 | 10 months ago | — scripts to build large silver dataset for coreference resolution |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries | |||
ВЕСУМ | 556 | 6 days ago | — POS tag dictionary. Can generate a list of all word forms valid for spelling |
Tonal dictionary | 46 | about 8 years ago | |
Multilingualsentiment, includes Ukrainian | a list of positive/negative words | ||
obscene-ukr | 16 | over 3 years ago | — profanity dictionary |
Word stress dictionary | 18 | 6 days ago | — word stress for 2.7M word forms. See |
Heteronyms | 3 | about 2 years ago | — words that share the same spelling but have different meaning/pronunciation |
Abbreviations | 3 | over 2 years ago | — map abbreviation to expansion |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts | |||
Aya | — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts | ||
awesome-ukrainian-nlp / 2. Tools | |||
tree_stem | 28 | almost 2 years ago | — stemmer |
pymorphy2 | 1,114 | 3 months ago | + — POS tagger and lemmatizer |
LanguageTool | — grammar, style and spell checker | ||
Stanza | — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER | ||
nlp-uk | 70 | 6 days ago | — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation |
NLP-Cube | 550 | 6 months ago | Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing |
awesome-ukrainian-nlp / 3. Pretrained models / Language models | |||
aya-101 | — massively multilingual LM, 13B parameters | ||
pythia-uk | — mT5 finetuned on wiki and oasst1 for chats in Ukrainian | ||
UAlpaca | 79 | 3 months ago | — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset |
XGLM | 30,200 | 26 days ago | — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian |
Tereveni-AI/GPT-2 | |||
uk4b | 16 | about 1 year ago | and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books |
xlm-roberta-base-uk | — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left | ||
youscan/ukr-roberta-base | |||
Electra | |||
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
Helsinki-NLP / OPUS-MT models | 29 | over 2 years ago | — Ukrainian to/from 25 langaguages |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models | |||
OPUS-MT models at HuggingFace | |||
OPUS-MT models evaluated on flores101 | 29 | over 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
M2M-100 | 30,200 | 26 days ago | — Ukrainian to/from 100 languages |
awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models | |||
mBART50 | 30,200 | 26 days ago | |
mT5 | 1,244 | almost 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER) | |||
MITIE NER Model | |||
ukr-models/uk-ner | |||
lang-uk/flair-uk-ner | |||
dchaplinsky/uk_ner_web_trf_large | |||
awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS) | |||
lang-uk/flair-uk-pos | |||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText | |||
Official fastText trained on CommonCrawl and Wiki | — 157 languages, including Ukrainian | ||
Older official fastText trained on Wiki | 25,869 | 7 months ago | — 294 languages, including Ukrainian |
fastText_multilingual | 1,195 | over 1 year ago | — 78 languages, aligned to the same vector space |
fasttext_uk (2023) | and — trained on UberText 2.0 | ||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings | |||
Word2Vec | |||
GloVe | |||
LexVec | |||
BPEmb: Subword Embeddings, includes Ukrainian | easy to use with | ||
Flair | 13,852 | 7 days ago | — added in 2022 |
awesome-ukrainian-nlp / 3. Pretrained models / Other | |||
uk-punctcase | — punctuation and case restoration model based on XLM-RoBERTa-Uk | ||
punctuation_uk_bert | — another punctation and case restoration model based on bert-base-multilingual-cased | ||
ukrainian-word-stress | 44 | 6 days ago | — adds word stress |
awesome-ukrainian-nlp / 4. Paid | |||
LORELEI Ukrainian Representative Language Pack | Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities | ||
awesome-ukrainian-nlp / 5. Other resources and links | |||
Helsinki-NLP/ UkrainianLT | 29 | over 2 years ago | — another collection of links to Ukrainian language tools |
egorsmkv / speech-recognition-uk | 336 | about 1 month ago | — speech recognition and text-to-speech models and datasets |
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
Ukrainian Natural Language Processing Workshop | |||
awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian | |||
Training data and evaluation scripts | 6 | over 1 year ago | |
Public leaderboard | |||
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
UNLP 2024 shared task | 13 | 6 months ago | — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian |