awesome-nlp-polish
NLP dataset hub
A curated collection of NLP resources and datasets for the Polish language.
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
293 stars
28 watching
34 forks
last commit: over 4 years ago
Linked from 2 awesome lists
datasetsnlpnlp-machine-learningpolish-language
awesome-nlp-polish / Polish text datasets / Task oriented datsets | |||
| The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding. | |||
awesome-nlp-polish / Polish text datasets / Task oriented datsets / PolEval datasets - | |||
| PolEval 2019 Task6 | Hate speech classification -distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful information (class: 1) [ ] [ ] | ||
awesome-nlp-polish / Polish text datasets / Task oriented datsets | |||
| Polish CDSCorpus | The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment | ||
| Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) | corpus of Polish reviews annotated with sentiment at the level of the whole text ( ) and at the level of sentences ( ) for the following domains: hotels, medicine, products and university (reviews*) | ||
| Ermlab Opineo dataset | 27 | about 3 years ago | opineo reviews - |
| http://zil.ipipan.waw.pl/HateSpeech | HateSpeech corpus contains over 2000 posts crawled from public Polish web | ||
| Polish analogy dataset | example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation | ||
| NKJP | National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus | ||
| PolEmo 2.0 Sentiment Analysis Dataset for CoNLL | |||
| Polish Music Dataset | Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists) | ||
awesome-nlp-polish / Polish text datasets / Raw texts | |||
| Polish OpenSubtitles v2018 | sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from | ||
| ParaCrawl v5 | sentences 6.4M, polish tokens 157.1M | ||
awesome-nlp-polish / Models and Embeddings / Polish Transformer models | |||
| Polish Roberta Model | 329 | over 1 year ago | model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus |
| PoLitBert | 33 | over 4 years ago | Polish RoBERTA model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that quality text will give good model |
| PolBert | 70 | about 5 years ago | Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with |
| Allegro HerBERT | 65 | almost 4 years ago | Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words |
| SlavicBert - multilingual BERT model | 73 | almost 4 years ago | -BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model but I have problems to convert it to pytorch |
awesome-nlp-polish / Models and Embeddings / Other models | |||
| ELMO embeddings | A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10) | ||
| Zalando Flair polish models | 13,990 | 12 months ago | Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward" |
| IPIPAN Word2vec polish models | |||
| Wrocław University of Science and Technology Word2Vec | Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia) | ||
| Common Crawl | 25,979 | over 1 year ago | FastText polish model FB - train on: , |
| FastText KGR10 polish model binary | |||
| Universal Sentence Encoder Multilingual | sentence embeddings, it covers 16 languages (including Polish) | ||
| BPEmb: Subword Embeddings includes polish | easy to use with | ||
| ULMFiT for Tensorflow 2.0 | this collection contains ULMFiT recurrent language models trained on Wikipedia dumps for English and Polish. The models themselves were trained using FastAI and then exported to a TensorFlow-usable format. Code is available on | ||
awesome-nlp-polish / Papers, articles, blog post | |||
| Benchmarks of some of polish NLP tools | Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc | ||
| https://github.com/sdadas/polish-nlp-resources | 329 | over 1 year ago | Github Repo with list of polish: word embeddings and language models (Word2vec, fasttext, Glove, Elmo) - |
| Polish Word Embeddings Review | 4 | almost 5 years ago | Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task |
| Polish Sentence Evaluation | 22 | almost 3 years ago | contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks |
| TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE | complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish | ||
More related projects:
-
vinairesearch/phobert
-
anoopkunchukuttan/indic_nlp_library
-
facebookresearch/laser
-
pawangeek/deep-nlp-resources
-
coqui-ai/tts
-
facebookresearch/fairseq
-
german-nlp-group/german-transformer-training
-
redditsota/state-of-the-art-result-for-machine-learning-problems
-
espnet/espnet
-
ibm/max-question-answering
-
plasticityai/magnitude
-
dccuchile/beto
-
ibm/max-text-sentiment-classifier