awesome-hungarian-nlp
NLP toolkit
A curated collection of NLP resources and tools for Hungarian language processing.
A curated list of NLP resources for Hungarian
224 stars
19 watching
18 forks
last commit: about 1 year ago
Linked from 2 awesome lists
awesomeawesome-listcomputational-linguisticscorpuscorpus-linguisticsdatasethungarianhungarian-languageinformation-extractioninformation-retrievalnamed-entity-recognitionnatural-language-processingnatural-language-understandingnlpnlp-resourcesnluopinion-miningparsertaggertext-mining
Awesome NLP Resources for Hungarian / Tools / Word tokenization, sentence splitting | |||
huntoken | 3 | almost 10 years ago | πππ― Hungarian word and sentence splitter |
quntoken | 14 | over 2 years ago | πππ― New Hungarian tokenizer based on quex, huntoken |
Awesome NLP Resources for Hungarian / Tools / Morphology | |||
emMorph (Humor) | 14 | almost 3 years ago | π― Hungarian morphological analyzer based on Humor |
emMorphPy | 3 | about 5 years ago | ππ―A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer |
hunmorph | ππ― is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages | ||
hunmorph-foma | 6 | over 8 years ago | ππ― Hungarian morpholical analyzer and generator based on hunmorph |
hunspell | πππ― is an open-source spell-checker, stemmer and morphological analyzer | ||
lara-hungarian-nlp | 29 | over 5 years ago | πππ― LARA is a lightweight Python NLP library for ChatBots in Hungarian |
Lemmagen | πππ― project aims at providing standardized open source multilingual platform for lemmatisation. ( | ) | ||
Simplemma | 144 | 7 days ago | πππ― is a simple multilingual lemmatizer for Python |
Awesome NLP Resources for Hungarian / Tools / PoS / Morphological taggers | |||
hunpos | πππ― Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants | ||
PurePos | 15 | about 4 years ago | ππ Open source morphological tagger based on HunPos |
purepos.py | 1 | almost 5 years ago | ππ Python wrapper for PurePos |
Awesome NLP Resources for Hungarian / Tools / Taggers / Chunkers | |||
HunTag | 22 | almost 9 years ago | ππ A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models |
HunTag3 | 8 | about 5 years ago | ππ Improved version of the original HunTag |
SzegedNER | πππ― Named Entity Recognition tool for Hungarian and English | ||
DBpedia Spotlight | 756 | over 6 years ago | πππ― DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text |
emBERT | 2 | 7 months ago | πππ― is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package |
Awesome NLP Resources for Hungarian / Tools / Pipelines with Hungarian NLP components | |||
magyarlanc | ππ― A toolkit for the basic linguistic processing of Hungarian | ||
magyarlanc_spark | 4 | over 7 years ago | ππ― Spark wrapper for magyarlanc |
eszterland | 4 | about 1 year ago | ππ― Clojurized access to magyarlanc |
HuSpaCy | 155 | 24 days ago | πππ― Industrial-strength Hungarian Natural Language Processing |
huNLP | 11 | about 7 years ago | ππ― An experimental unified Java and REST API for magyarlanc and szegedNER |
hunlp-GATE | 8 | almost 6 years ago | π― GATE plugin containing Hungarian NLP tools as GATE processing resources |
Trendminer Hungarian Processing Pipeline | 5 | about 10 years ago | π Hungarian NLP pipeline for social media text analysis (TrendMiner project) |
Google Syntaxnet | ππ― Neural Models of Syntax | ||
UDPipe | πππ― is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files | ||
polyglot | πππ― is a natural language pipeline that supports massive multilingual applications | ||
emtsv | 27 | 11 months ago | ππ― is a text processing system with inter-module communication via tsv + REST API |
Stanza | 7,294 | 5 days ago | πππ― is a Python NLP Library for Many Human Languages |
spaCy StanfordNLP | 725 | 3 months ago | πππ― wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline |
trankit | 736 | about 1 month ago | πππ― A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing |
Awesome NLP Resources for Hungarian / Tools / Syntactic parsers | |||
hunpars | ππ― A rule based Hungarian syntactical analyzer | ||
HunParse | 4 | over 12 years ago | ππ― An NLTK-based parser using KR-style morphological annotation |
Anagramma Parser | 1 | about 6 years ago | A parser based on psycholinguistics principles |
benepar | 871 | almost 3 years ago | πππ― A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018 |
Awesome NLP Resources for Hungarian / Tools / Semantic analysis | |||
SentimentAnalysisHUN | 11 | about 8 years ago | πππ― is an open-source sentiment analysis tool for Hungarian language, written in Python |
hun-date-parser | 8 | 3 months ago | πππ― A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text |
mT5-small-HunSum-1 | SZTAKI HunSum-1 models πππ― , , , | ||
Awesome NLP Resources for Hungarian / Tools / Other | |||
emLam | 3 | over 4 years ago | πππ― Preprocessing scripts for Hungarian Language Modeling |
pywnxml | 5 | over 6 years ago | πππ― Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format) |
Hun-appointment-chatbot | 7 | almost 2 years ago | πππ― A simple Hungarian chatbot for booking an appointment using the Rasa framework |
neural-punctuator | 48 | about 1 year ago | πππ― Automatic punctuation restoration with BERT models for English and Hungarian |
hunaccent | 15 | 3 months ago | πππ― Small Footprint Diacritic Restoration for Hungarian |
Diacritics_restoration | 4 | over 2 years ago | ππ― Lightweight Diacritics Restoration with Dilated Convolutional Neural Networks |
NYTK MT | 5 | over 1 year ago | πππ― NYTK Machine translation models |
syntax-augmentation-nmt | 7 | about 1 year ago | ππ― Syntax-based data augmentation for Hungarian-English machine translation |
anonymizer_hu | 1 | over 2 years ago | ππ― The Hungarian anonymization tool for CURLICAT |
Awesome NLP Resources for Hungarian / Language models / Word embeddings | |||
FasText Wikipedia | pre-trained word vectors for 90 languages, trained on Wikipedia using fastText | ||
FasText Common Crawl & Wikipedia | pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model | ||
FastText_multilingual | 1,197 | over 1 year ago | Multilingual word vectors in 78 languages |
polyglot vectors | polgyglot embeddings on Wikipedia | ||
wordvectors | 2,215 | about 6 years ago | Pre-trained word2vec and fasttext word vectors on wikipedia of 30+ languages |
hunembed0.0 | A word2vec word embedding trained on the concatenation of the Hungarian Webcorpus and the Hungarian National Corpus in 600 dimensions with a cut-off of 10 words | ||
Szeged word vectors | Word embeddings (word2vec & fasttext) for Hungarian trained on 4.3 billion tokens | ||
questions-words-hu | Hungarian analogical questions following Mikolov et al | ||
Conceptnet Numberbatch | 1,295 | over 2 years ago | Conceptnet numbermatch multi- and cross-lingual semantic word embeddings |
Multi-sense word embeddings | |||
BytePair Embeddings | pretrained Subword Embeddings, downloadable in many formats | ||
HuSpaCy 300d | 300d Floret embeddings trained on the Hungarian Webcorpus 2.0 | ||
HuSpaCy 100d | 100d Floret embeddings trained on the Hungarian Webcorpus 2.0 | ||
ELMo Representations | 1,463 | over 3 years ago | Deep contextualized word representation trained for many languages |
Awesome NLP Resources for Hungarian / Language models / Transformer models | |||
huBERT | Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia | ||
HIL* Transformer models | Pretrained transformer models provided by HILANCO | ||
PULI-BERT-Large | is a Hungarian BERT large model based on MegatronBERT | ||
PULI-GPT-2 | is a Hungarian GPT-2 model | ||
PULI-GPT-3SX | is a Hungarian GPT-NeoX model (6.7 billion parameter) | ||
Awesome NLP Resources for Hungarian / Datasets / Corpora | |||
Hungarian Webcorpus | With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license | ||
Hungarian Webcorpus 2.0 | The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words | ||
OSCAR | is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words) | ||
emLam | A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English | ||
Leipzig corpora | contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web | ||
web2corpus | Automatically created multilingual web corpus | ||
CC-100 | Monolingual Datasets from Web Crawl Data | ||
CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings | Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by , together with word embeddings of dimension 100 computed from lowercased texts by | ||
OpinHuBank | OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian | ||
HunEmPoli | 0 | almost 2 years ago | corpus was built using pre-agenda speeches of the Hungarian National Assembly (2014-2018) and consists 764008 tokens/36475 sentences. Aspect level emotion annotation, with 39840 identified emotions, in addition, marked the keywords that evoked the emotion |
The Hungarian forum corpus for Opinion Mining | This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship | ||
Hungarian sentiment corpus (HuSent) | is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [ | ||
Szeged Treebank | The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language | ||
Szeged Dependency Treebank | The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank | ||
Universal Dependencies | 5 | 9 days ago | |
Hungarian Named Entity Corpora | The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts | ||
KorKor Pilotcorpus | 2 | almost 2 years ago | is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations |
NerKor | 14 | about 1 year ago | is a gold standard named entity annotated corpus containing 1 million tokens |
NerKor 1.41e | 1 | almost 3 years ago | A 1M+-token Hungarian named entity dataset with ~30 entity types derived from NYTK-NerKor |
hunNERwiki | a silver standard corpus for Hungarian Named Entity Recognition | ||
Mazsola database | contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis | ||
PrevCons | 2 | about 3 years ago | is a database of 21K hapaxes of verbs with verbal prefixes |
Hungarian word sense disambiguated corpus | containing 39 suitable word form samples for the purpose of word sense disambiguation | ||
HunLearner | is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool | ||
HuLU | 9 | 4 months ago | Hungarian Language Understanding Benchmark Kit |
Awesome NLP Resources for Hungarian / Datasets / Corpora / HuLU | |||
HuCOLA | 1 | 4 months ago | Hungarian Corpus of Linguistic Acceptability |
HuCoPA | 1 | 4 months ago | Hungarian Choice of Plausible Alternatives Corpus |
HuSST | 1 | 4 months ago | Hungarian version of the Sentiment Treebank |
HuWNLI | 0 | almost 2 years ago | Anaphora resolution datasets for Hungarian as an inference task |
HuWS | 1 | almost 2 years ago | is the Hungarian set of the Winograd schemas |
Awesome NLP Resources for Hungarian / Datasets / Corpora | |||
HuRC | Hungarian Corpus for Reading Comprehension with Commonsense Reasoning | ||
ELTE Poetry Corpus | 7 | 6 months ago | is a database of complete poems of 50 Hungarian canonical poets together with the sound devices of the poems and the grammatical features of words in XML format |
ELTE Novel Corpus | 4 | 5 days ago | is a database of 400 Hungarian novels (with the annotation of structural units and the grammatical features of words in TEI XML format) |
ELTE Drama Corpus | 1 | 4 months ago | is a database of 58 dramas (with the annotation of structural units and the grammatical features of words in TEI XML format) |
HumSum-1 | is a dataset containing over 1.1M unique news articles with lead and other metadata | ||
HAPP | 1 | 9 months ago | is the Hungarian translation of the |
Hunglish Corpus | The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs | ||
SzegedParallel | The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria | ||
HunOr | A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words | ||
CoNLL 2017 Shared Task Hungarian data | Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl | ||
CSS10 | 465 | over 4 years ago | A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian |
Hungarian-Russian Prisoner of War Database | 23 | over 3 years ago | |
TED talks transcripts parallel corpus | sentence aligned TED talks including Hungarian | ||
TaPaCo Corpus | is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database | ||
Duolingo STAPLE | is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian | ||
PPDB | is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian | ||
OpenSubtitles Corpus | contains movie subtitles and alignments for 62 languages, including Hungarian | ||
https://opus.nlpl.eu] | [OPUS Corpus][ is a growing collection of translated texts from the web | ||
MASSIVE dataset | 538 | almost 2 years ago | is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation |
PWS | 0 | almost 2 years ago | is a parallel collection of the Winograd schemas in seven languages (including Hungarian) |
Awesome NLP Resources for Hungarian / Datasets / Linguistic resources | |||
morphdb.hu | is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions | ||
huwn | 11 | about 6 years ago | Hungarian Wordnet |
Hungarian Sentiment Lexicon | The dictionaries were manually created on the basis of Wordnet-Affect lexicons | ||
poltextLAB's sentiment lexicons | 1 | about 2 years ago | Highly accurate sentiment lexicons for analysing news data |
4lang | 37 | 8 months ago | Concept dictionary using Eilenberg machines |
Named Entity lists for Hungarian | |||
Mazsola ISZ | lists 500K verb frames extracted from the Mazsola database | ||
Manocska | 4 | over 5 years ago | merges verb frames existing databases |
PrevLex | 0 | over 3 years ago | List of phrasel verbs |
panmorph | 4 | over 3 years ago | Tagsets and description of Hungarian morphological analysers |
hun_ner_checklist | 0 | almost 4 years ago | CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition |
Awesome NLP Resources for Hungarian / Datasets / Linked Open Data | |||
Wikipedia dumps | |||
Wikidata dumps | |||
DBPedia dumps | |||
huwn.rdf | 2 | over 9 years ago | Hungarian WordNet in RDF format for the Linked Open Data cloud |
Conceptnet | An open, multilingual knowledge graph (with partial Hungarian support) | ||
Awesome NLP Resources for Hungarian / Datasets / Geo data | |||
OpenStreetMap(OSM) | In the keys, the | ||
Natural-earth-vector | 1,794 | 7 months ago | ( imported from wikidata labels) |
Who's On First | is a gazetteer of places (with ) | ||
Awesome NLP Resources for Hungarian / Datasets / Speech related data | |||
Hungarian Single Speaker Speech Dataset | |||
Mozilla Common Voice | |||
Awesome NLP Resources for Hungarian / Academy / Journals | |||
Acta Cybernetica | |||
Awesome NLP Resources for Hungarian / Academy / Conferences | |||
MSZNY | Conference on Hungarian Computational Linguistics (since 2003) | ||
Awesome NLP Resources for Hungarian / Academy / Institutes | |||
Natural Language Processing Group of the PΓ‘zmΓ‘ny PΓ©ter Catholic University Faculty of Information Tehnology and Bionics | |||
Department of Language Technology and Applied Linguistics, RIL-MTA | |||
Human Language Technology Research Group of the Budapest University of Technology and Economics | |||
Natural Language Processing Group of the SzegedUniversity | |||
BME - Laboratory of Speech Acoustics | |||
Awesome NLP Resources for Hungarian / Learning resources / Books | |||
SzΓΆvegbΓ‘nyΓ‘szat | |||
SzΓΆvegbΓ‘nyΓ‘szat Γ©s mestersΓ©ges intelligencia R-ben | |||
KvantitatΓv szΓΆvegelemzΓ©s Γ©s szΓΆvegbΓ‘nyΓ‘szat a politikatudomΓ‘nyban | |||
Awesome NLP Resources for Hungarian / Learning resources / Courses | |||
NLP Courses by the University Of Szeged | |||
NLP Courses by the HLT Group of the Budapest University of Technology | |||
Awesome NLP Resources for Hungarian / Learning resources / Tutorials | |||
Mini NLP Course by the Center Of Digital Humanities | |||
Tutorial on Text Mining for Hungarian | 20 | over 2 years ago | |
Awesome NLP Resources for Hungarian / Communities | |||
KeresΕ vilΓ‘g | Official blog of Precognox Inc | ||
Hungarian NLP Meetup | |||
Deep Learning Reading Seminar Meetup | |||
HuNLP Slack | |||
Awesome NLP Resources for Hungarian / Other Hungarian related resource collections | |||
EENLP | 18 | over 2 years ago | The broad index of NLP resources for Eastern European languages |
European Language Grid | |||
Hugging Face Datasets (filtered for Hungarian) |
More related projects:
- gianlucabertani/machinelearning
- facebookresearch/fasttext
- facebookresearch/muse
- dccuchile/spanish-word-embeddings
- pymorphy2/pymorphy2
- amakukha/stemmers_ukrainian
- curiosity-ai/catalyst
- web64/norwegian-nlp-resources
- helsinki-nlp/ukrainianlt
- pawangeek/deep-nlp-resources
- jdidion/biotools
- jameslavin/my_tech_resources