awesome-nlp-polish

NLP dataset hub

A curated collection of NLP resources and datasets for the Polish language.

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

GitHub

293 stars

28 watching

34 forks

last commit: almost 5 years ago

Linked from 2 awesome lists

datasetsnlpnlp-machine-learningpolish-language

awesome-nlp-polish / Polish text datasets / Task oriented datsets
The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.
awesome-nlp-polish / Polish text datasets / Task oriented datsets / PolEval datasets -
PolEval 2019 Task6			Hate speech classification -distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful information (class: 1) [ ] [ ]
awesome-nlp-polish / Polish text datasets / Task oriented datsets
Polish CDSCorpus			The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment
Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS)			corpus of Polish reviews annotated with sentiment at the level of the whole text ( ) and at the level of sentences ( ) for the following domains: hotels, medicine, products and university (reviews*)
Ermlab Opineo dataset	27	over 3 years ago	opineo reviews -
http://zil.ipipan.waw.pl/HateSpeech			HateSpeech corpus contains over 2000 posts crawled from public Polish web
Polish analogy dataset			example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
NKJP			National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus
PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
Polish Music Dataset			Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists)
awesome-nlp-polish / Polish text datasets / Raw texts
Polish OpenSubtitles v2018			sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from
ParaCrawl v5			sentences 6.4M, polish tokens 157.1M
awesome-nlp-polish / Models and Embeddings / Polish Transformer models
Polish Roberta Model	329	about 2 years ago	model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus
PoLitBert	33	about 5 years ago	Polish RoBERTA model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that quality text will give good model
PolBert	70	almost 6 years ago	Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with
Allegro HerBERT	65	over 4 years ago	Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words
SlavicBert - multilingual BERT model	73	over 4 years ago	-BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model but I have problems to convert it to pytorch
awesome-nlp-polish / Models and Embeddings / Other models
ELMO embeddings			A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10)
Zalando Flair polish models	13,990	over 1 year ago	Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
IPIPAN Word2vec polish models
Wrocław University of Science and Technology Word2Vec			Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia)
Common Crawl	25,979	over 2 years ago	FastText polish model FB - train on: ,
FastText KGR10 polish model binary
Universal Sentence Encoder Multilingual			sentence embeddings, it covers 16 languages (including Polish)
BPEmb: Subword Embeddings includes polish			easy to use with
ULMFiT for Tensorflow 2.0			this collection contains ULMFiT recurrent language models trained on Wikipedia dumps for English and Polish. The models themselves were trained using FastAI and then exported to a TensorFlow-usable format. Code is available on
awesome-nlp-polish / Papers, articles, blog post
Benchmarks of some of polish NLP tools			Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc
https://github.com/sdadas/polish-nlp-resources	329	about 2 years ago	Github Repo with list of polish: word embeddings and language models (Word2vec, fasttext, Glove, Elmo) -
Polish Word Embeddings Review	4	over 5 years ago	Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task
Polish Sentence Evaluation	22	over 3 years ago	contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE			complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish

awesome-nlp-polish

awesome-nlp-polish / Polish text datasets / Task oriented datsets

awesome-nlp-polish / Polish text datasets / Task oriented datsets / PolEval datasets -

awesome-nlp-polish / Polish text datasets / Task oriented datsets

awesome-nlp-polish / Polish text datasets / Raw texts

awesome-nlp-polish / Models and Embeddings / Polish Transformer models

awesome-nlp-polish / Models and Embeddings / Other models

awesome-nlp-polish / Papers, articles, blog post

Backlinks from these awesome lists:

More related projects: