awesome-ukrainian-nlp
Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
155 stars
11 watching
14 forks
last commit: 7 months ago
Linked from 1 awesome list
awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp
awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual | |||
| Malyuk | — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News | ||
| Brown-UK | 110 | 13 days ago | — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words |
| UberText 2.0 | — over 5 GB of news, Wikipedia, social, fiction, and legal texts | ||
| Wikipedia | |||
| OSCAR | — shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated | ||
| CC-100 | — documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text | ||
| mC4 | 11,727 | almost 2 years ago | — filtered CommonCrawl again, 196GB of Ukrainian text |
| Ukrainian Twitter corpus | 15 | about 5 years ago | Ukrainian Twitter corpus for toxic text detection |
| Ukrainian forums | 3 | about 7 years ago | — 250k sentences scraped from forums |
| Ukrainain news headlines | — 5.2M news headlines | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel | |||
| OPUS | |||
| Tatoeba MT Challenge data sets | 796 | about 1 month ago | |
| Polish-Ukrainian Parallel Corpus | |||
| Back-translated monolingual Wiki data | 796 | about 1 month ago | |
| Wiki Edits | — 5M sentence edits extracted from the Ukrainian Wikipedia revision history | ||
awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled | |||
| ZNO | — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO) | ||
| UA-GEC | 255 | 8 months ago | — grammatical error correction (GEC) and fluency corpus |
| NER-uk | 89 | 6 months ago | — Brown-UK labeled for named entities |
| Yakaboo Book Reviews | — book reviews, ratings and descriptions | ||
| Universal Dependencies | 27 | 5 months ago | — dependency trees corpus |
| ua-news | 53 | 2 months ago | — 150k news article in 5 categories |
| UA-SQuAD | 53 | 2 months ago | — Ukrainian version of Stanford Question Answering Dataset |
| Ukrainian Winograd schema challenge (WSC) Dataset | 6 | 9 months ago | — manually translated |
| Ukrainian OntoNotes Dataset | 6 | 9 months ago | — scripts to build large silver dataset for coreference resolution |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries | |||
| ВЕСУМ | 552 | 16 days ago | — POS tag dictionary. Can generate a list of all word forms valid for spelling |
| Tonal dictionary | 46 | about 8 years ago | |
| Multilingualsentiment, includes Ukrainian | a list of positive/negative words | ||
| obscene-ukr | 16 | over 3 years ago | — profanity dictionary |
| Word stress dictionary | 18 | about 2 years ago | — word stress for 2.7M word forms. See |
| Heteronyms | 3 | about 2 years ago | — words that share the same spelling but have different meaning/pronunciation |
| Abbreviations | 3 | over 2 years ago | — map abbreviation to expansion |
awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts | |||
| Aya | — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts | ||
awesome-ukrainian-nlp / 2. Tools | |||
| tree_stem | 28 | almost 2 years ago | — stemmer |
| pymorphy2 | 1,114 | 3 months ago | + — POS tagger and lemmatizer |
| LanguageTool | — grammar, style and spell checker | ||
| Stanza | — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER | ||
| nlp-uk | 70 | 16 days ago | — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation |
| NLP-Cube | 550 | 6 months ago | Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing |
awesome-ukrainian-nlp / 3. Pretrained models / Language models | |||
| aya-101 | — massively multilingual LM, 13B parameters | ||
| pythia-uk | — mT5 finetuned on wiki and oasst1 for chats in Ukrainian | ||
| UAlpaca | 78 | 3 months ago | — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset |
| XGLM | 30,200 | 19 days ago | — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian |
| Tereveni-AI/GPT-2 | |||
| uk4b | 16 | about 1 year ago | and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books |
| xlm-roberta-base-uk | — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left | ||
| youscan/ukr-roberta-base | |||
| Electra | |||
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
| Helsinki-NLP / OPUS-MT models | 29 | over 2 years ago | — Ukrainian to/from 25 langaguages |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models | |||
| OPUS-MT models at HuggingFace | |||
| OPUS-MT models evaluated on flores101 | 29 | over 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation | |||
| M2M-100 | 30,200 | 19 days ago | — Ukrainian to/from 100 languages |
awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models | |||
| mBART50 | 30,200 | 19 days ago | |
| mT5 | 1,240 | almost 2 years ago | |
awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER) | |||
| MITIE NER Model | |||
| ukr-models/uk-ner | |||
| lang-uk/flair-uk-ner | |||
| dchaplinsky/uk_ner_web_trf_large | |||
awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS) | |||
| lang-uk/flair-uk-pos | |||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText | |||
| Official fastText trained on CommonCrawl and Wiki | — 157 languages, including Ukrainian | ||
| Older official fastText trained on Wiki | 25,847 | 6 months ago | — 294 languages, including Ukrainian |
| fastText_multilingual | 1,193 | over 1 year ago | — 78 languages, aligned to the same vector space |
| fasttext_uk (2023) | and — trained on UberText 2.0 | ||
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings | |||
| Word2Vec | |||
| GloVe | |||
| LexVec | |||
| BPEmb: Subword Embeddings, includes Ukrainian | easy to use with | ||
| Flair | 13,814 | 11 days ago | — added in 2022 |
awesome-ukrainian-nlp / 3. Pretrained models / Other | |||
| uk-punctcase | — punctuation and case restoration model based on XLM-RoBERTa-Uk | ||
| punctuation_uk_bert | — another punctation and case restoration model based on bert-base-multilingual-cased | ||
| ukrainian-word-stress | 42 | 8 months ago | — adds word stress |
awesome-ukrainian-nlp / 4. Paid | |||
| LORELEI Ukrainian Representative Language Pack | Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities | ||
awesome-ukrainian-nlp / 5. Other resources and links | |||
| Helsinki-NLP/ UkrainianLT | 29 | over 2 years ago | — another collection of links to Ukrainian language tools |
| egorsmkv / speech-recognition-uk | 333 | about 1 month ago | — speech recognition and text-to-speech models and datasets |
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
| Ukrainian Natural Language Processing Workshop | |||
awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian | |||
| Training data and evaluation scripts | 6 | over 1 year ago | |
| Public leaderboard | |||
awesome-ukrainian-nlp / 6. Workshops and conferences | |||
| UNLP 2024 shared task | 13 | 6 months ago | — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian |