awesome-hungarian-nlp

NLP toolkit

A curated collection of NLP resources and tools for Hungarian language processing.

A curated list of NLP resources for Hungarian

GitHub

224 stars
19 watching
18 forks
last commit: about 1 year ago
Linked from 2 awesome lists

awesomeawesome-listcomputational-linguisticscorpuscorpus-linguisticsdatasethungarianhungarian-languageinformation-extractioninformation-retrievalnamed-entity-recognitionnatural-language-processingnatural-language-understandingnlpnlp-resourcesnluopinion-miningparsertaggertext-mining

Awesome NLP Resources for Hungarian / Tools / Word tokenization, sentence splitting

huntoken 3 almost 10 years ago πŸ‘ŒπŸš€πŸ’― Hungarian word and sentence splitter
quntoken 14 over 2 years ago πŸ‘ŒπŸš€πŸ’― New Hungarian tokenizer based on quex, huntoken

Awesome NLP Resources for Hungarian / Tools / Morphology

emMorph (Humor) 14 almost 3 years ago πŸ’― Hungarian morphological analyzer based on Humor
emMorphPy 3 about 5 years ago πŸ‘ŒπŸ’―A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer
hunmorph πŸš€πŸ’― is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages
hunmorph-foma 6 over 8 years ago πŸš€πŸ’― Hungarian morpholical analyzer and generator based on hunmorph
hunspell πŸ‘ŒπŸš€πŸ’― is an open-source spell-checker, stemmer and morphological analyzer
lara-hungarian-nlp 29 over 5 years ago πŸ‘ŒπŸš€πŸ’― LARA is a lightweight Python NLP library for ChatBots in Hungarian
Lemmagen πŸ‘ŒπŸš€πŸ’― project aims at providing standardized open source multilingual platform for lemmatisation. ( | )
Simplemma 144 7 days ago πŸ‘ŒπŸš€πŸ’― is a simple multilingual lemmatizer for Python

Awesome NLP Resources for Hungarian / Tools / PoS / Morphological taggers

hunpos πŸ‘ŒπŸš€πŸ’― Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants
PurePos 15 about 4 years ago πŸ‘ŒπŸš€ Open source morphological tagger based on HunPos
purepos.py 1 almost 5 years ago πŸ‘ŒπŸš€ Python wrapper for PurePos

Awesome NLP Resources for Hungarian / Tools / Taggers / Chunkers

HunTag 22 almost 9 years ago πŸ‘ŒπŸš€ A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
HunTag3 8 about 5 years ago πŸ‘ŒπŸš€ Improved version of the original HunTag
SzegedNER πŸ‘ŒπŸš€πŸ’― Named Entity Recognition tool for Hungarian and English
DBpedia Spotlight 756 over 6 years ago πŸ‘ŒπŸš€πŸ’― DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text
emBERT 2 7 months ago πŸ‘ŒπŸš€πŸ’― is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package

Awesome NLP Resources for Hungarian / Tools / Pipelines with Hungarian NLP components

magyarlanc πŸ‘ŒπŸ’― A toolkit for the basic linguistic processing of Hungarian
magyarlanc_spark 4 over 7 years ago πŸ‘ŒπŸ’― Spark wrapper for magyarlanc
eszterland 4 about 1 year ago πŸ‘ŒπŸ’― Clojurized access to magyarlanc
HuSpaCy 155 24 days ago πŸ‘ŒπŸš€πŸ’― Industrial-strength Hungarian Natural Language Processing
huNLP 11 about 7 years ago πŸ‘ŒπŸ’― An experimental unified Java and REST API for magyarlanc and szegedNER
hunlp-GATE 8 almost 6 years ago πŸ’― GATE plugin containing Hungarian NLP tools as GATE processing resources
Trendminer Hungarian Processing Pipeline 5 about 10 years ago πŸš€ Hungarian NLP pipeline for social media text analysis (TrendMiner project)
Google Syntaxnet πŸš€πŸ’― Neural Models of Syntax
UDPipe πŸ‘ŒπŸš€πŸ’― is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
polyglot πŸ‘ŒπŸš€πŸ’― is a natural language pipeline that supports massive multilingual applications
emtsv 27 11 months ago πŸ‘ŒπŸ’― is a text processing system with inter-module communication via tsv + REST API
Stanza 7,294 5 days ago πŸ‘ŒπŸš€πŸ’― is a Python NLP Library for Many Human Languages
spaCy StanfordNLP 725 3 months ago πŸ‘ŒπŸš€πŸ’― wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline
trankit 736 about 1 month ago πŸ‘ŒπŸš€πŸ’― A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Awesome NLP Resources for Hungarian / Tools / Syntactic parsers

hunpars πŸš€πŸ’― A rule based Hungarian syntactical analyzer
HunParse 4 over 12 years ago πŸš€πŸ’― An NLTK-based parser using KR-style morphological annotation
Anagramma Parser 1 about 6 years ago A parser based on psycholinguistics principles
benepar 871 almost 3 years ago πŸ‘ŒπŸš€πŸ’― A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018

Awesome NLP Resources for Hungarian / Tools / Semantic analysis

SentimentAnalysisHUN 11 about 8 years ago πŸ‘ŒπŸš€πŸ’― is an open-source sentiment analysis tool for Hungarian language, written in Python
hun-date-parser 8 3 months ago πŸ‘ŒπŸš€πŸ’― A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text
mT5-small-HunSum-1 SZTAKI HunSum-1 models πŸ‘ŒπŸš€πŸ’― , , ,

Awesome NLP Resources for Hungarian / Tools / Other

emLam 3 over 4 years ago πŸ‘ŒπŸš€πŸ’― Preprocessing scripts for Hungarian Language Modeling
pywnxml 5 over 6 years ago πŸ‘ŒπŸš€πŸ’― Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)
Hun-appointment-chatbot 7 almost 2 years ago πŸ‘ŒπŸš€πŸ’― A simple Hungarian chatbot for booking an appointment using the Rasa framework
neural-punctuator 48 about 1 year ago πŸ‘ŒπŸš€πŸ’― Automatic punctuation restoration with BERT models for English and Hungarian
hunaccent 15 3 months ago πŸ‘ŒπŸš€πŸ’― Small Footprint Diacritic Restoration for Hungarian
Diacritics_restoration 4 over 2 years ago πŸš€πŸ’― Lightweight Diacritics Restoration with Dilated Convolutional Neural Networks
NYTK MT 5 over 1 year ago πŸ‘ŒπŸš€πŸ’― NYTK Machine translation models
syntax-augmentation-nmt 7 about 1 year ago πŸš€πŸ’― Syntax-based data augmentation for Hungarian-English machine translation
anonymizer_hu 1 over 2 years ago πŸš€πŸ’― The Hungarian anonymization tool for CURLICAT

Awesome NLP Resources for Hungarian / Language models / Word embeddings

FasText Wikipedia pre-trained word vectors for 90 languages, trained on Wikipedia using fastText
FasText Common Crawl & Wikipedia pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model
FastText_multilingual 1,197 over 1 year ago Multilingual word vectors in 78 languages
polyglot vectors polgyglot embeddings on Wikipedia
wordvectors 2,215 about 6 years ago Pre-trained word2vec and fasttext word vectors on wikipedia of 30+ languages
hunembed0.0 A word2vec word embedding trained on the concatenation of the Hungarian Webcorpus and the Hungarian National Corpus in 600 dimensions with a cut-off of 10 words
Szeged word vectors Word embeddings (word2vec & fasttext) for Hungarian trained on 4.3 billion tokens
questions-words-hu Hungarian analogical questions following Mikolov et al
Conceptnet Numberbatch 1,295 over 2 years ago Conceptnet numbermatch multi- and cross-lingual semantic word embeddings
Multi-sense word embeddings
BytePair Embeddings pretrained Subword Embeddings, downloadable in many formats
HuSpaCy 300d 300d Floret embeddings trained on the Hungarian Webcorpus 2.0
HuSpaCy 100d 100d Floret embeddings trained on the Hungarian Webcorpus 2.0
ELMo Representations 1,463 over 3 years ago Deep contextualized word representation trained for many languages

Awesome NLP Resources for Hungarian / Language models / Transformer models

huBERT Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia
HIL* Transformer models Pretrained transformer models provided by HILANCO
PULI-BERT-Large is a Hungarian BERT large model based on MegatronBERT
PULI-GPT-2 is a Hungarian GPT-2 model
PULI-GPT-3SX is a Hungarian GPT-NeoX model (6.7 billion parameter)

Awesome NLP Resources for Hungarian / Datasets / Corpora

Hungarian Webcorpus With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license
Hungarian Webcorpus 2.0 The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words
OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)
emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English
Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web
web2corpus Automatically created multilingual web corpus
CC-100 Monolingual Datasets from Web Crawl Data
CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by , together with word embeddings of dimension 100 computed from lowercased texts by
OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
HunEmPoli 0 almost 2 years ago corpus was built using pre-agenda speeches of the Hungarian National Assembly (2014-2018) and consists 764008 tokens/36475 sentences. Aspect level emotion annotation, with 39840 identified emotions, in addition, marked the keywords that evoked the emotion
The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship
Hungarian sentiment corpus (HuSent) is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [
Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank
Universal Dependencies 5 9 days ago
Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts
KorKor Pilotcorpus 2 almost 2 years ago is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations
NerKor 14 about 1 year ago is a gold standard named entity annotated corpus containing 1 million tokens
NerKor 1.41e 1 almost 3 years ago A 1M+-token Hungarian named entity dataset with ~30 entity types derived from NYTK-NerKor
hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
Mazsola database contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
PrevCons 2 about 3 years ago is a database of 21K hapaxes of verbs with verbal prefixes
Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool
HuLU 9 4 months ago Hungarian Language Understanding Benchmark Kit

Awesome NLP Resources for Hungarian / Datasets / Corpora / HuLU

HuCOLA 1 4 months ago Hungarian Corpus of Linguistic Acceptability
HuCoPA 1 4 months ago Hungarian Choice of Plausible Alternatives Corpus
HuSST 1 4 months ago Hungarian version of the Sentiment Treebank
HuWNLI 0 almost 2 years ago Anaphora resolution datasets for Hungarian as an inference task
HuWS 1 almost 2 years ago is the Hungarian set of the Winograd schemas

Awesome NLP Resources for Hungarian / Datasets / Corpora

HuRC Hungarian Corpus for Reading Comprehension with Commonsense Reasoning
ELTE Poetry Corpus 7 6 months ago is a database of complete poems of 50 Hungarian canonical poets together with the sound devices of the poems and the grammatical features of words in XML format
ELTE Novel Corpus 4 5 days ago is a database of 400 Hungarian novels (with the annotation of structural units and the grammatical features of words in TEI XML format)
ELTE Drama Corpus 1 4 months ago is a database of 58 dramas (with the annotation of structural units and the grammatical features of words in TEI XML format)
HumSum-1 is a dataset containing over 1.1M unique news articles with lead and other metadata
HAPP 1 9 months ago is the Hungarian translation of the
Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs
SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria
HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words
CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl
CSS10 465 over 4 years ago A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian
Hungarian-Russian Prisoner of War Database 23 over 3 years ago
TED talks transcripts parallel corpus sentence aligned TED talks including Hungarian
TaPaCo Corpus is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database
Duolingo STAPLE is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian
PPDB is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian
OpenSubtitles Corpus contains movie subtitles and alignments for 62 languages, including Hungarian
https://opus.nlpl.eu] [OPUS Corpus][ is a growing collection of translated texts from the web
MASSIVE dataset 538 almost 2 years ago is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation
PWS 0 almost 2 years ago is a parallel collection of the Winograd schemas in seven languages (including Hungarian)

Awesome NLP Resources for Hungarian / Datasets / Linguistic resources

morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions
huwn 11 about 6 years ago Hungarian Wordnet
Hungarian Sentiment Lexicon The dictionaries were manually created on the basis of Wordnet-Affect lexicons
poltextLAB's sentiment lexicons 1 about 2 years ago Highly accurate sentiment lexicons for analysing news data
4lang 37 8 months ago Concept dictionary using Eilenberg machines
Named Entity lists for Hungarian
Mazsola ISZ lists 500K verb frames extracted from the Mazsola database
Manocska 4 over 5 years ago merges verb frames existing databases
PrevLex 0 over 3 years ago List of phrasel verbs
panmorph 4 over 3 years ago Tagsets and description of Hungarian morphological analysers
hun_ner_checklist 0 almost 4 years ago CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition

Awesome NLP Resources for Hungarian / Datasets / Linked Open Data

Wikipedia dumps
Wikidata dumps
DBPedia dumps
huwn.rdf 2 over 9 years ago Hungarian WordNet in RDF format for the Linked Open Data cloud
Conceptnet An open, multilingual knowledge graph (with partial Hungarian support)

Awesome NLP Resources for Hungarian / Datasets / Geo data

OpenStreetMap(OSM) In the keys, the
Natural-earth-vector 1,794 7 months ago ( imported from wikidata labels)
Who's On First is a gazetteer of places (with )
Hungarian Single Speaker Speech Dataset
Mozilla Common Voice

Awesome NLP Resources for Hungarian / Academy / Journals

Acta Cybernetica

Awesome NLP Resources for Hungarian / Academy / Conferences

MSZNY Conference on Hungarian Computational Linguistics (since 2003)

Awesome NLP Resources for Hungarian / Academy / Institutes

Natural Language Processing Group of the PΓ‘zmΓ‘ny PΓ©ter Catholic University Faculty of Information Tehnology and Bionics
Department of Language Technology and Applied Linguistics, RIL-MTA
Human Language Technology Research Group of the Budapest University of Technology and Economics
Natural Language Processing Group of the SzegedUniversity
BME - Laboratory of Speech Acoustics

Awesome NLP Resources for Hungarian / Learning resources / Books

SzΓΆvegbΓ‘nyΓ‘szat
SzΓΆvegbΓ‘nyΓ‘szat Γ©s mestersΓ©ges intelligencia R-ben
KvantitatΓ­v szΓΆvegelemzΓ©s Γ©s szΓΆvegbΓ‘nyΓ‘szat a politikatudomΓ‘nyban

Awesome NLP Resources for Hungarian / Learning resources / Courses

NLP Courses by the University Of Szeged
NLP Courses by the HLT Group of the Budapest University of Technology

Awesome NLP Resources for Hungarian / Learning resources / Tutorials

Mini NLP Course by the Center Of Digital Humanities
Tutorial on Text Mining for Hungarian 20 over 2 years ago

Awesome NLP Resources for Hungarian / Communities

KeresΕ‘ vilΓ‘g Official blog of Precognox Inc
Hungarian NLP Meetup
Deep Learning Reading Seminar Meetup
HuNLP Slack
EENLP 18 over 2 years ago The broad index of NLP resources for Eastern European languages
European Language Grid
Hugging Face Datasets (filtered for Hungarian)

Backlinks from these awesome lists:

More related projects: