awesome-danish
Danish NLP dataset
A curated collection of Danish language resources and datasets for natural language processing tasks.
A curated list of awesome resources for Danish language technology
165 stars
16 watching
18 forks
last commit: 17 days ago
Linked from 2 awesome lists
awesomeawesome-listdanish
Awesome Danish / Data / Corpora | |||
Danish Gigaword | Collection of 10^12 words of Danish text. Described in ( ) | ||
Danish review dataset | 2 | about 1 year ago | Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews |
OSCAR | Danish corpus derived from the Common Crawl corpus. Described in ( ) | ||
Awesome Danish / Data / Corpora / CLARIN-DK-UCPH | |||
The Danish Parliament Corpus 2009 - 2017, v1 | . The license is Creative Commons - Attribution 4.0 International | ||
Grundtvig's Works Corpus | . Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International | ||
DK-CLARIN Reference Corpus of General Danish | Only for academic use | ||
Awesome Danish / Data / Corpora | |||
DanFEVER | Danish text corpus with over 6'400 claims and support. Described in ( ) | ||
DanNet | wordnet with usage examples. The usage examples have been used for word sense disambiguation, see | ||
SemDaX | 1 | almost 2 years ago | POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only |
NOMCO | "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ ] | ||
Danish Propbank | commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles | ||
Danish Dependency Treebank v. 1.0 | Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE | ||
Mr. Bean corpus | Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in | ||
Køge Corpus | Danish-Turkish transcribed corpus by Jens Normann Jørgensen | ||
Danske taler | Collection of Danish speeches. API available at | ||
DKhate | corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in ( ) | ||
Scholia | DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in ( ) | ||
Awesome Danish / Data / Corpora / Wikipedia | |||
wiki40b/da | Clean-up text from Danish Wikipedia. Described in . ( ) | ||
Awesome Danish / Data / Corpora | |||
XED | 56 | over 1 year ago | emotion annotated movie subtitles. Described in ( ) |
DaN+ | 5 | almost 2 years ago | annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in |
WikiANN | Named entity annotated corpus. Described in ( ) | ||
Corona Dataset | 11 | over 4 years ago | Question dataset from Certainly annotated for domain and intent |
Awesome Danish / Data / Parallel corpora | |||
Europarl | parallel sentences between Danish and English from the European Parlament | ||
ITU Faroese Pairs Dataset | Faroese-Danish parallel text. Described in ( ) | ||
JW300 | "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average" | ||
OpenSubtitles2018 | Parallel corpus from movie and tv subtitles. Described in | ||
Tatoeba | Sentences | ||
WikiMatrix | 3,599 | 7 months ago | , parallel sentences from Wikipedias. 1620 language pairs, including Danish |
Awesome Danish / Data / Spoken language corpora | |||
CoRal | Danish Conversational and Read-aloud Dataset | ||
DanPASS | Described in ( ) | ||
LANCHART | Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., ( ) | ||
Common Voice | Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at . Common Voice is described in ( ) | ||
FT Speech | Described in ( ) | ||
Awesome Danish / Data / Spoken language corpora / NST | |||
NST-speech-22khz | A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation | ||
NST-speech-16kHz | A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing | ||
NST-speech-44kHz | A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis | ||
Awesome Danish / Data / Spoken language corpora | |||
VoxLingua107 | 28 hours audio with unannotated Danish speech sampled from YouTube videos. Described in ( ) | ||
VoxPopuli | 512 | over 1 year ago | Speech from the European Parliament including 13'600 hours of unannotated Danish. Described in ( ) |
Wikimedia Commons Audio files of Danish language | Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works | ||
Awesome Danish / Data / Dictionaries and ontologies | |||
Det Centrale Ordregister | identifier for words and their inflections with 516,017 forms (COR) | ||
The Danish Sentiment Lexicon | 8 | almost 2 years ago | Det Danske Sentimentleksikon (DDS) 13,859 headwords assigned with polarity values |
NST-lexical-database | A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service | ||
DanNet, Danish Wordnet (v 2.2) - owl format | DanNet - Danish wordnet with three-clause BSD-like license | ||
Retskrivningsordbogen | . The official Danish spelling dictionary digitally available under its own special license | ||
Awesome Danish / Data / Dictionaries and ontologies / Retskrivningsordbogen | |||
Opslagsord og ordklasser | in CSV format | ||
Excerpt | Lexemes, word classes and inflections. in the CSF format available. Full list presumably available upon request | ||
Awesome Danish / Data / Dictionaries and ontologies | |||
Stavekontrolden | word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL | ||
The Concise Danish Dictionary | /The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license | ||
Interactive Terminology for Europe | (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms | ||
The Danish FrameNet Lexicon | , 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns | ||
1,290,000 lexemes | Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over including in April 2024 | ||
Awesome Danish / Data / Dictionaries and ontologies / 1,290,000 lexemes | |||
Overview over Danish lexemes in Ordia | webapp with overview of content of Wikidata lexemes based on SPARQL queries | ||
Wikidata lexemes latest lexemes dump in ttl | official dump of lexeme-only part of Wikidata | ||
Awesome Danish / Data / Dictionaries and ontologies | |||
NST-ngrams | A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM | ||
AFINN | 449 | over 2 years ago | Danish lexicons annotated for sentiment |
concreteness-estimates-da | 0 | almost 7 years ago | Bill D. Thompson's concreteness estimates for Danish words, as detailed in ( ) |
SAM lexicon | 7 | over 4 years ago | sentiment analysis word list extended from AFINN to 4275 lines. Described in |
Danish Swadesh List | List of Danish words of basic concepts from The Rosetta Project | ||
Sketch Engine | cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use | ||
Awesome Danish / Data / Word sets | |||
Danish-Similarity-Dataset | 8 | over 4 years ago | Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in |
Wordsim353-da | 18 | about 4 years ago | Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in |
Four words | 18 | about 4 years ago | 100 odd-one-out sets of 4 words or phrases |
Awesome Danish / Data / Embeddings | |||
cc.da.300 | ( ) - fastText-trained embedding on Danish part of and Danish Wikipedia. Read more about the method in ( ) | ||
wiki.da | ( ) - fastText-trained embedding on Danish Wikipedia. Read more about the method in ( ) | ||
Byte-Pair Encoding embedding | 1,184 | about 2 months ago | Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300) |
NLPL word embeddings repository | NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020 | ||
Awesome Danish / Data / Embeddings / NLPL word embeddings repository | |||
Danish NLPL word embedding | 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus | ||
Awesome Danish / Data / Embeddings | |||
Danish DSL and Reddit word2vec word embeddings | 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit | ||
Awesome Danish / Data / Neural text models | |||
A-ttack | 6 | over 1 year ago | Ælæctra-based model for detection of "textual attacks" developed by . Related to the Ha-te model |
Danish BERT | 161 | about 3 years ago | Certainly's (Botxo/Møllerhøj) Weights for a BERT trained on a large Danish corpora |
Danish ELECTRA | 30 | about 3 years ago | Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library |
daT5-summariser | Danish abstractive summarisation of news articles based on mT5-base | ||
ConvBERT | 30 | about 3 years ago | Philip Tamimi-Sarnikowski's model |
Danish ELMo on OSCAR | (Link does not work as of December 2020) | ||
Ha-te | 6 | almost 3 years ago | Hate speech detection based on Ælæctra developed by . Related to the A-ttack model |
mfaq | Multilingual FAQ retrieval model. Described in ( ) | ||
Ælæctra | Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model | ||
Multilingual sentence transformers | Pre-trained multilingual sentence transformers, | ||
wiki40b-lm-da | language model trained on Danish from Wiki40B dataset | ||
WikiBERT | 34 | over 4 years ago | BERT model for many languages, including Danish. Described in ( ) |
Awesome Danish / Data / Neural speech models | |||
Hugging Face | List of models for Danish automatic speech recognition | ||
Alvenir Wav2vec2 | Pretrained Danish neural model | ||
Whisper | Multilingual neural model from OpenAI | ||
xls-r-300m-danish-nst-cv9 | Pretrained Danish neural model | ||
Awesome Danish / Tools / Lemmatization | |||
Lemmy | 75 | about 3 years ago | Lemmatizer for Danish in Python |
cstlemma | 35 | 4 months ago | lemmatiser |
spaCy | Python-based package with lemmatization | ||
Awesome Danish / Tools / Punctuation | |||
punctfix | 22 | 9 months ago | "Adds punctuation and capitalization for a given text." |
Awesome Danish / Tools / Named entity recognition | |||
ScandiNER | Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese | ||
DaLUKE | 9 | almost 3 years ago | Danish named entity recognition based on LUKE. Described in |
spaCy | Python-based named entity extraction | ||
daner | 17 | over 5 years ago | Named entity extraction from ITU NLP. Described in ( ) |
flair+danlp ner-tagger | 198 | 11 months ago | Flair NER tagger trained by the Alexandra Institute |
Polyglot named entity extraction | - | ||
Awesome Danish / Tools / Entity linking | |||
Babelfy | Web app and service for linking words and entities | ||
DBpedia Spotlight | DBpedia-based entity linker. Described in ( ) | ||
Awesome Danish / Tools / Sentiment analysis | |||
afinn | 449 | over 2 years ago | Python package with AFINN Danish lexicon annotated for sentiment, also installable with |
Hisia | 13 | 6 months ago | Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel |
senda | 19 | over 3 years ago | Python package with transformer-based sentiment analysis from Ekstra Bladet Analyse with as of 2021 on one dataset |
Sentida | 20 | almost 3 years ago | R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in ( ) |
Awesome Danish / Tools / Automatic Speech Recognition | |||
danspeech | 28 | almost 2 years ago | DeepSpeech2-based Danish speech recognition in Python |
kaldi-sprakbanken | 14,287 | about 2 months ago | A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database |
Awesome Danish / Tools / Speech Synthesis (text-to-speech) | |||
espeak | An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe | ||
ResponsiveVoice | Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use | ||
Google Cloud Text-to-Speech | Commercial Web-based text-to-speech synthesis for a number of languages, including Danish | ||
Amazon Polly | Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at | ||
Awesome Danish / Tools / Fundamental processing | |||
DaNLP | 198 | 11 months ago | "a repository for Natural Language Processing resources for the Danish Language." |
dapipe | 7 | over 6 years ago | Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies |
UDPipe | Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at | ||
DKIE | GATE pipeline including wrapped Danish models for Stanford CoreNLP | ||
StanfordNLP | . Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available | ||
bornholmsk | 2 | over 2 years ago | Datasets and embeddings for the Bornholmsk dialect |
spaCy | Python-based natural language processing package | ||
dacy | 93 | 22 days ago | Danish spaCy pipeline |
Awesome Danish / Competitions | |||
ELEXIS Monolingual Word Sense Alignment Task | Predicting the relationship between two senses in each of several languages, including Danish | ||
OffensEval 2020 - Danish | Offensive Language Identification in Social Media competition. Described in ( ) | ||
Awesome Danish / Benchmarks | |||
Danoliterate | Overview of the performance of language models on a range of individual benchmarks | ||
Awesome Danish / Resources about resources | |||
Danish resources | Finn Årup Nielsen's PDF with pointers to Danish resources | ||
Scholia's topic aspect for Danish | , works (mostly scientific articles) about "Danish" as listed in Wikidata | ||
DaNLP | 198 | 11 months ago | Alexandra Institute's list of Danish resources |
Language Technology Resources for Danish | , list from Det Dansk Sprog- og Litteraturselskab | ||
European Language Resources Association (ELRA) list for Danish | , list of various annotated corpora available for purchase with both commercial and non-commercial licenses | ||
sprogteknologi.dk | List of Danish language resources. Compiled by the Agency for Digitisation |
More related projects:
- cmusphinx/pocketsphinx-python
- ibm/max-inception-resnet-v2
- ermlab/politbert
- ryankiros/skip-thoughts
- keunwoochoi/auralisation
- ypwhs/captcha_break
- ibm/max-resnet-50
- kobiso/cbam-keras
- stared/keras-sequential-ascii
- bonlime/keras-deeplab-v3-plus
- dbln/stochastic_depth_keras
- anoopkunchukuttan/indic_nlp_library
- vinairesearch/phobert
- pgvector/pgvector