awesome-danish

Danish NLP dataset

A curated collection of Danish language resources and datasets for natural language processing tasks.

A curated list of awesome resources for Danish language technology

GitHub

165 stars
16 watching
18 forks
last commit: 17 days ago
Linked from 2 awesome lists

awesomeawesome-listdanish

Awesome Danish / Data / Corpora

Danish Gigaword Collection of 10^12 words of Danish text. Described in ( )
Danish review dataset 2 about 1 year ago Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews
OSCAR Danish corpus derived from the Common Crawl corpus. Described in ( )

Awesome Danish / Data / Corpora / CLARIN-DK-UCPH

The Danish Parliament Corpus 2009 - 2017, v1 . The license is Creative Commons - Attribution 4.0 International
Grundtvig's Works Corpus . Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International
DK-CLARIN Reference Corpus of General Danish Only for academic use

Awesome Danish / Data / Corpora

DanFEVER Danish text corpus with over 6'400 claims and support. Described in ( )
DanNet wordnet with usage examples. The usage examples have been used for word sense disambiguation, see
SemDaX 1 almost 2 years ago POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only
NOMCO "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ ]
Danish Propbank commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles
Danish Dependency Treebank v. 1.0 Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE
Mr. Bean corpus Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in
Køge Corpus Danish-Turkish transcribed corpus by Jens Normann Jørgensen
Danske taler Collection of Danish speeches. API available at
DKhate corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in ( )
Scholia DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in ( )

Awesome Danish / Data / Corpora / Wikipedia

wiki40b/da Clean-up text from Danish Wikipedia. Described in . ( )

Awesome Danish / Data / Corpora

XED 56 over 1 year ago emotion annotated movie subtitles. Described in ( )
DaN+ 5 almost 2 years ago annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in
WikiANN Named entity annotated corpus. Described in ( )
Corona Dataset 11 over 4 years ago Question dataset from Certainly annotated for domain and intent

Awesome Danish / Data / Parallel corpora

Europarl parallel sentences between Danish and English from the European Parlament
ITU Faroese Pairs Dataset Faroese-Danish parallel text. Described in ( )
JW300 "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
OpenSubtitles2018 Parallel corpus from movie and tv subtitles. Described in
Tatoeba Sentences
WikiMatrix 3,599 7 months ago , parallel sentences from Wikipedias. 1620 language pairs, including Danish

Awesome Danish / Data / Spoken language corpora

CoRal Danish Conversational and Read-aloud Dataset
DanPASS Described in ( )
LANCHART Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., ( )
Common Voice Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at . Common Voice is described in ( )
FT Speech Described in ( )

Awesome Danish / Data / Spoken language corpora / NST

NST-speech-22khz A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation
NST-speech-16kHz A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing
NST-speech-44kHz A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis

Awesome Danish / Data / Spoken language corpora

VoxLingua107 28 hours audio with unannotated Danish speech sampled from YouTube videos. Described in ( )
VoxPopuli 512 over 1 year ago Speech from the European Parliament including 13'600 hours of unannotated Danish. Described in ( )
Wikimedia Commons Audio files of Danish language Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works

Awesome Danish / Data / Dictionaries and ontologies

Det Centrale Ordregister identifier for words and their inflections with 516,017 forms (COR)
The Danish Sentiment Lexicon 8 almost 2 years ago Det Danske Sentimentleksikon (DDS) 13,859 headwords assigned with polarity values
NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service
DanNet, Danish Wordnet (v 2.2) - owl format DanNet - Danish wordnet with three-clause BSD-like license
Retskrivningsordbogen . The official Danish spelling dictionary digitally available under its own special license

Awesome Danish / Data / Dictionaries and ontologies / Retskrivningsordbogen

Opslagsord og ordklasser in CSV format
Excerpt Lexemes, word classes and inflections. in the CSF format available. Full list presumably available upon request

Awesome Danish / Data / Dictionaries and ontologies

Stavekontrolden word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL
The Concise Danish Dictionary /The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
Interactive Terminology for Europe (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms
The Danish FrameNet Lexicon , 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
1,290,000 lexemes Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over including in April 2024

Awesome Danish / Data / Dictionaries and ontologies / 1,290,000 lexemes

Overview over Danish lexemes in Ordia webapp with overview of content of Wikidata lexemes based on SPARQL queries
Wikidata lexemes latest lexemes dump in ttl official dump of lexeme-only part of Wikidata

Awesome Danish / Data / Dictionaries and ontologies

NST-ngrams A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM
AFINN 449 over 2 years ago Danish lexicons annotated for sentiment
concreteness-estimates-da 0 almost 7 years ago Bill D. Thompson's concreteness estimates for Danish words, as detailed in ( )
SAM lexicon 7 over 4 years ago sentiment analysis word list extended from AFINN to 4275 lines. Described in
Danish Swadesh List List of Danish words of basic concepts from The Rosetta Project
Sketch Engine cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use

Awesome Danish / Data / Word sets

Danish-Similarity-Dataset 8 over 4 years ago Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in
Wordsim353-da 18 about 4 years ago Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in
Four words 18 about 4 years ago 100 odd-one-out sets of 4 words or phrases

Awesome Danish / Data / Embeddings

cc.da.300 ( ) - fastText-trained embedding on Danish part of and Danish Wikipedia. Read more about the method in ( )
wiki.da ( ) - fastText-trained embedding on Danish Wikipedia. Read more about the method in ( )
Byte-Pair Encoding embedding 1,184 about 2 months ago Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300)
NLPL word embeddings repository NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020

Awesome Danish / Data / Embeddings / NLPL word embeddings repository

Danish NLPL word embedding 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus

Awesome Danish / Data / Embeddings

Danish DSL and Reddit word2vec word embeddings 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit

Awesome Danish / Data / Neural text models

A-ttack 6 over 1 year ago Ælæctra-based model for detection of "textual attacks" developed by . Related to the Ha-te model
Danish BERT 161 about 3 years ago Certainly's (Botxo/Møllerhøj) Weights for a BERT trained on a large Danish corpora
Danish ELECTRA 30 about 3 years ago Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library
daT5-summariser Danish abstractive summarisation of news articles based on mT5-base
ConvBERT 30 about 3 years ago Philip Tamimi-Sarnikowski's model
Danish ELMo on OSCAR (Link does not work as of December 2020)
Ha-te 6 almost 3 years ago Hate speech detection based on Ælæctra developed by . Related to the A-ttack model
mfaq Multilingual FAQ retrieval model. Described in ( )
Ælæctra Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
Multilingual sentence transformers Pre-trained multilingual sentence transformers,
wiki40b-lm-da language model trained on Danish from Wiki40B dataset
WikiBERT 34 over 4 years ago BERT model for many languages, including Danish. Described in ( )

Awesome Danish / Data / Neural speech models

Hugging Face List of models for Danish automatic speech recognition
Alvenir Wav2vec2 Pretrained Danish neural model
Whisper Multilingual neural model from OpenAI
xls-r-300m-danish-nst-cv9 Pretrained Danish neural model

Awesome Danish / Tools / Lemmatization

Lemmy 75 about 3 years ago Lemmatizer for Danish in Python
cstlemma 35 4 months ago lemmatiser
spaCy Python-based package with lemmatization

Awesome Danish / Tools / Punctuation

punctfix 22 9 months ago "Adds punctuation and capitalization for a given text."

Awesome Danish / Tools / Named entity recognition

ScandiNER Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese
DaLUKE 9 almost 3 years ago Danish named entity recognition based on LUKE. Described in
spaCy Python-based named entity extraction
daner 17 over 5 years ago Named entity extraction from ITU NLP. Described in ( )
flair+danlp ner-tagger 198 11 months ago Flair NER tagger trained by the Alexandra Institute
Polyglot named entity extraction -

Awesome Danish / Tools / Entity linking

Babelfy Web app and service for linking words and entities
DBpedia Spotlight DBpedia-based entity linker. Described in ( )

Awesome Danish / Tools / Sentiment analysis

afinn 449 over 2 years ago Python package with AFINN Danish lexicon annotated for sentiment, also installable with
Hisia 13 6 months ago Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel
senda 19 over 3 years ago Python package with transformer-based sentiment analysis from Ekstra Bladet Analyse with as of 2021 on one dataset
Sentida 20 almost 3 years ago R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in ( )

Awesome Danish / Tools / Automatic Speech Recognition

danspeech 28 almost 2 years ago DeepSpeech2-based Danish speech recognition in Python
kaldi-sprakbanken 14,287 about 2 months ago A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database

Awesome Danish / Tools / Speech Synthesis (text-to-speech)

espeak An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe
ResponsiveVoice Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use
Google Cloud Text-to-Speech Commercial Web-based text-to-speech synthesis for a number of languages, including Danish
Amazon Polly Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at

Awesome Danish / Tools / Fundamental processing

DaNLP 198 11 months ago "a repository for Natural Language Processing resources for the Danish Language."
dapipe 7 over 6 years ago Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies
UDPipe Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at
DKIE GATE pipeline including wrapped Danish models for Stanford CoreNLP
StanfordNLP . Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available
bornholmsk 2 over 2 years ago Datasets and embeddings for the Bornholmsk dialect
spaCy Python-based natural language processing package
dacy 93 22 days ago Danish spaCy pipeline

Awesome Danish / Competitions

ELEXIS Monolingual Word Sense Alignment Task Predicting the relationship between two senses in each of several languages, including Danish
OffensEval 2020 - Danish Offensive Language Identification in Social Media competition. Described in ( )

Awesome Danish / Benchmarks

Danoliterate Overview of the performance of language models on a range of individual benchmarks

Awesome Danish / Resources about resources

Danish resources Finn Årup Nielsen's PDF with pointers to Danish resources
Scholia's topic aspect for Danish , works (mostly scientific articles) about "Danish" as listed in Wikidata
DaNLP 198 11 months ago Alexandra Institute's list of Danish resources
Language Technology Resources for Danish , list from Det Dansk Sprog- og Litteraturselskab
European Language Resources Association (ELRA) list for Danish , list of various annotated corpora available for purchase with both commercial and non-commercial licenses
sprogteknologi.dk List of Danish language resources. Compiled by the Agency for Digitisation

Backlinks from these awesome lists:

More related projects: