awesome-nlp-polish

NLP dataset hub

A curated collection of NLP resources and datasets for the Polish language.

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

GitHub

294 stars
28 watching
34 forks
last commit: over 3 years ago
Linked from 2 awesome lists

datasetsnlpnlp-machine-learningpolish-language

awesome-nlp-polish / Polish text datasets / Task oriented datsets

The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.

awesome-nlp-polish / Polish text datasets / Task oriented datsets / PolEval datasets -

PolEval 2019 Task6 Hate speech classification -distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful information (class: 1) [ ] [ ]

awesome-nlp-polish / Polish text datasets / Task oriented datsets

Polish CDSCorpus The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment
Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) corpus of Polish reviews annotated with sentiment at the level of the whole text ( ) and at the level of sentences ( ) for the following domains: hotels, medicine, products and university (reviews*)
Ermlab Opineo dataset 27 almost 2 years ago opineo reviews -
http://zil.ipipan.waw.pl/HateSpeech HateSpeech corpus contains over 2000 posts crawled from public Polish web
Polish analogy dataset example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
NKJP National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus
PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
Polish Music Dataset Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists)

awesome-nlp-polish / Polish text datasets / Raw texts

Polish OpenSubtitles v2018 sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from
ParaCrawl v5 sentences 6.4M, polish tokens 157.1M

awesome-nlp-polish / Models and Embeddings / Polish Transformer models

Polish Roberta Model 323 6 months ago model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus
PoLitBert 33 over 3 years ago Polish RoBERTA model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that quality text will give good model
PolBert 70 about 4 years ago Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with
Allegro HerBERT 65 almost 3 years ago Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words
SlavicBert - multilingual BERT model 73 almost 3 years ago -BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model but I have problems to convert it to pytorch

awesome-nlp-polish / Models and Embeddings / Other models

ELMO embeddings A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10)
Zalando Flair polish models 13,939 6 days ago Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
IPIPAN Word2vec polish models
Wrocław University of Science and Technology Word2Vec Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia)
Common Crawl 25,945 8 months ago FastText polish model FB - train on: ,
FastText KGR10 polish model binary
Universal Sentence Encoder Multilingual sentence embeddings, it covers 16 languages (including Polish)
BPEmb: Subword Embeddings includes polish easy to use with
ULMFiT for Tensorflow 2.0 this collection contains ULMFiT recurrent language models trained on Wikipedia dumps for English and Polish. The models themselves were trained using FastAI and then exported to a TensorFlow-usable format. Code is available on

awesome-nlp-polish / Papers, articles, blog post

Benchmarks of some of polish NLP tools Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc
https://github.com/sdadas/polish-nlp-resources 323 6 months ago Github Repo with list of polish: word embeddings and language models (Word2vec, fasttext, Glove, Elmo) -
Polish Word Embeddings Review 4 almost 4 years ago Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task
Polish Sentence Evaluation 22 almost 2 years ago contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish

Backlinks from these awesome lists:

More related projects: