awesome-ukrainian-nlp

NLP toolkit

A curated collection of Ukrainian NLP resources for research and development.

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

GitHub

168 stars

11 watching

14 forks

last commit: 10 months ago

Linked from 1 awesome list

awesome-listdatasetsnatural-language-processingnlpukrainianukrainian-nlp

awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual
Malyuk			— 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News
Brown-UK	110	11 months ago	— carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
UberText 2.0			— over 5 GB of news, Wikipedia, social, fiction, and legal texts
Wikipedia
OSCAR			— shuffled sentences extracted from and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated
CC-100			— documents extracted from , automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text
mC4	11,774	over 2 years ago	— filtered CommonCrawl again, 196GB of Ukrainian text
Ukrainian Twitter corpus	15	about 6 years ago	Ukrainian Twitter corpus for toxic text detection
Ukrainian forums	3	about 8 years ago	— 250k sentences scraped from forums
Ukrainain news headlines			— 5.2M news headlines
awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel
OPUS
Tatoeba MT Challenge data sets	811	12 months ago
Polish-Ukrainian Parallel Corpus
Back-translated monolingual Wiki data	811	12 months ago
Wiki Edits			— 5M sentence edits extracted from the Ukrainian Wikipedia revision history
awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled
ZNO			— ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO)
UA-GEC	255	over 1 year ago	— grammatical error correction (GEC) and fluency corpus
NER-uk	90	over 1 year ago	— Brown-UK labeled for named entities
Yakaboo Book Reviews			— book reviews, ratings and descriptions
Universal Dependencies	27	9 months ago	— dependency trees corpus
ua-news	57	about 1 year ago	— 150k news article in 5 categories
UA-SQuAD	57	about 1 year ago	— Ukrainian version of Stanford Question Answering Dataset
Ukrainian Winograd schema challenge (WSC) Dataset	7	over 1 year ago	— manually translated
Ukrainian OntoNotes Dataset	7	over 1 year ago	— scripts to build large silver dataset for coreference resolution
awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries
ВЕСУМ	560	8 months ago	— POS tag dictionary. Can generate a list of all word forms valid for spelling
Tonal dictionary	47	almost 9 years ago
Multilingualsentiment, includes Ukrainian			a list of positive/negative words
obscene-ukr	17	about 4 years ago	— profanity dictionary
Word stress dictionary	19	10 months ago	— word stress for 2.7M word forms. See
Heteronyms	3	about 3 years ago	— words that share the same spelling but have different meaning/pronunciation
Abbreviations	3	over 3 years ago	— map abbreviation to expansion
awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts
Aya			— crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts
awesome-ukrainian-nlp / 2. Tools
tree_stem	28	over 2 years ago	— stemmer
pymorphy2	1,127	about 1 year ago	+ — POS tagger and lemmatizer
LanguageTool			— grammar, style and spell checker
Stanza			— Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
nlp-uk	72	9 months ago	— Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
NLP-Cube	555	9 months ago	Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing
awesome-ukrainian-nlp / 3. Pretrained models / Language models
aya-101			— massively multilingual LM, 13B parameters
pythia-uk			— mT5 finetuned on wiki and oasst1 for chats in Ukrainian
UAlpaca	86	about 1 year ago	— Llama fine-tuned for instruction following on the machine-translated Alpaca dataset
XGLM	30,675	10 months ago	— multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian
Tereveni-AI/GPT-2
uk4b	18	almost 2 years ago	and - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books
xlm-roberta-base-uk			— truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left
youscan/ukr-roberta-base
Electra
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation
Helsinki-NLP / OPUS-MT models	30	over 3 years ago	— Ukrainian to/from 25 langaguages
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models
OPUS-MT models at HuggingFace
OPUS-MT models evaluated on flores101	30	over 3 years ago
awesome-ukrainian-nlp / 3. Pretrained models / Machine translation
M2M-100	30,675	10 months ago	— Ukrainian to/from 100 languages
Uk-En folktale corpus	0	almost 3 years ago	— small sentence-aligned corpus of fairy tales
awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models
mBART50	30,675	10 months ago
mT5	1,251	over 2 years ago
awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER)
MITIE NER Model
ukr-models/uk-ner
lang-uk/flair-uk-ner
dchaplinsky/uk_ner_web_trf_large
awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS)
lang-uk/flair-uk-pos
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText
Official fastText trained on CommonCrawl and Wiki			— 157 languages, including Ukrainian
Older official fastText trained on Wiki	25,979	over 1 year ago	— 294 languages, including Ukrainian
fastText_multilingual	1,197	over 2 years ago	— 78 languages, aligned to the same vector space
fasttext_uk (2023)			and — trained on UberText 2.0
awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings
Word2Vec
GloVe
LexVec
BPEmb: Subword Embeddings, includes Ukrainian			easy to use with
Flair	13,990	8 months ago	— added in 2022
awesome-ukrainian-nlp / 3. Pretrained models / Other
uk-punctcase			— punctuation and case restoration model based on XLM-RoBERTa-Uk
punctuation_uk_bert			— another punctation and case restoration model based on bert-base-multilingual-cased
ukrainian-word-stress	45	10 months ago	— adds word stress
awesome-ukrainian-nlp / 4. Paid
LORELEI Ukrainian Representative Language Pack			Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
awesome-ukrainian-nlp / 5. Other resources and links
Helsinki-NLP/ UkrainianLT	30	over 3 years ago	— another collection of links to Ukrainian language tools
egorsmkv / speech-recognition-uk	344	8 months ago	— speech recognition and text-to-speech models and datasets
awesome-ukrainian-nlp / 6. Workshops and conferences
Ukrainian Natural Language Processing Workshop
awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian
Training data and evaluation scripts	7	over 2 years ago
Public leaderboard
awesome-ukrainian-nlp / 6. Workshops and conferences
UNLP 2024 shared task	13	over 1 year ago	— shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian

Backlinks from these awesome lists:

keon/awesome-nlp

awesome-ukrainian-nlp

awesome-ukrainian-nlp / 1. Datasets / Corpora / Monolingual

awesome-ukrainian-nlp / 1. Datasets / Corpora / Parallel

awesome-ukrainian-nlp / 1. Datasets / Corpora / Labeled

awesome-ukrainian-nlp / 1. Datasets / Corpora / Dictionaries

awesome-ukrainian-nlp / 1. Datasets / Corpora / Prompts

awesome-ukrainian-nlp / 2. Tools

awesome-ukrainian-nlp / 3. Pretrained models / Language models

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation / Helsinki-NLP / OPUS-MT models

awesome-ukrainian-nlp / 3. Pretrained models / Machine translation

awesome-ukrainian-nlp / 3. Pretrained models / Sequence-to-sequence models

awesome-ukrainian-nlp / 3. Pretrained models / Named-entity recognition (NER)

awesome-ukrainian-nlp / 3. Pretrained models / Part-of-speech tagging (POS)

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings / fastText

awesome-ukrainian-nlp / 3. Pretrained models / Word embeddings

awesome-ukrainian-nlp / 3. Pretrained models / Other

awesome-ukrainian-nlp / 4. Paid

awesome-ukrainian-nlp / 5. Other resources and links

awesome-ukrainian-nlp / 6. Workshops and conferences

awesome-ukrainian-nlp / 6. Workshops and conferences / UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian

awesome-ukrainian-nlp / 6. Workshops and conferences

Backlinks from these awesome lists:

More related projects: