awesome-danish

Danish NLP dataset

A curated collection of Danish language resources and datasets for natural language processing tasks.

A curated list of awesome resources for Danish language technology

GitHub

168 stars

16 watching

18 forks

last commit: over 1 year ago

Linked from 2 awesome lists

awesomeawesome-listdanish

Awesome Danish / Data / Corpora
Danish Gigaword			Collection of 10^12 words of Danish text. Described in ( )
Danish review dataset	2	almost 3 years ago	Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews
OSCAR			Danish corpus derived from the Common Crawl corpus. Described in ( )
Awesome Danish / Data / Corpora / CLARIN-DK-UCPH
The Danish Parliament Corpus 2009 - 2017, v1			. The license is Creative Commons - Attribution 4.0 International
Grundtvig's Works Corpus			. Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International
DK-CLARIN Reference Corpus of General Danish			Only for academic use
Awesome Danish / Data / Corpora
DanFEVER			Danish text corpus with over 6'400 claims and support. Described in ( )
DanNet			wordnet with usage examples. The usage examples have been used for word sense disambiguation, see
SemDaX	1	over 3 years ago	POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only
NOMCO			"an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ ]
Danish Propbank			commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles
Danish Dependency Treebank v. 1.0			Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE
Mr. Bean corpus			Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in
Køge Corpus			Danish-Turkish transcribed corpus by Jens Normann Jørgensen
Danske taler			Collection of Danish speeches. API available at
DKhate			corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in ( )
Scholia			DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in ( )
Awesome Danish / Data / Corpora / Wikipedia
wiki40b/da			Clean-up text from Danish Wikipedia. Described in . ( )
Awesome Danish / Data / Corpora
XED	56	about 3 years ago	emotion annotated movie subtitles. Described in ( )
DaN+	5	over 3 years ago	annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in
WikiANN			Named entity annotated corpus. Described in ( )
Corona Dataset	11	about 6 years ago	Question dataset from Certainly annotated for domain and intent
Awesome Danish / Data / Parallel corpora
Europarl			parallel sentences between Danish and English from the European Parlament
ITU Faroese Pairs Dataset			Faroese-Danish parallel text. Described in ( )
JW300			"a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
OpenSubtitles2018			Parallel corpus from movie and tv subtitles. Described in
Tatoeba			Sentences
WikiMatrix	3,604	about 2 years ago	, parallel sentences from Wikipedias. 1620 language pairs, including Danish
Awesome Danish / Data / Spoken language corpora
CoRal			Danish Conversational and Read-aloud Dataset
DanPASS			Described in ( )
LANCHART			Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., ( )
Common Voice			Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at . Common Voice is described in ( )
FT Speech			Described in ( )
Awesome Danish / Data / Spoken language corpora / NST
NST-speech-22khz			A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation
NST-speech-16kHz			A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing
NST-speech-44kHz			A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis
Awesome Danish / Data / Spoken language corpora
VoxLingua107			28 hours audio with unannotated Danish speech sampled from YouTube videos. Described in ( )
VoxPopuli	517	over 3 years ago	Speech from the European Parliament including 13'600 hours of unannotated Danish. Described in ( )
Wikimedia Commons Audio files of Danish language			Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works
Awesome Danish / Data / Dictionaries and ontologies
Det Centrale Ordregister			identifier for words and their inflections with 516,017 forms (COR)
The Danish Sentiment Lexicon	8	over 3 years ago	Det Danske Sentimentleksikon (DDS) 13,859 headwords assigned with polarity values
NST-lexical-database			A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service
DanNet, Danish Wordnet (v 2.2) - owl format			DanNet - Danish wordnet with three-clause BSD-like license
Retskrivningsordbogen			. The official Danish spelling dictionary digitally available under its own special license
Awesome Danish / Data / Dictionaries and ontologies / Retskrivningsordbogen
Opslagsord og ordklasser			in CSV format
Excerpt			Lexemes, word classes and inflections. in the CSF format available. Full list presumably available upon request
Awesome Danish / Data / Dictionaries and ontologies
Stavekontrolden			word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL
The Concise Danish Dictionary			/The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
Interactive Terminology for Europe			(IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms
The Danish FrameNet Lexicon			, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
1,290,000 lexemes			Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over including in April 2024
Awesome Danish / Data / Dictionaries and ontologies / 1,290,000 lexemes
Overview over Danish lexemes in Ordia			webapp with overview of content of Wikidata lexemes based on SPARQL queries
Wikidata lexemes latest lexemes dump in ttl			official dump of lexeme-only part of Wikidata
Awesome Danish / Data / Dictionaries and ontologies
NST-ngrams			A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM
AFINN	453	over 4 years ago	Danish lexicons annotated for sentiment
concreteness-estimates-da	0	over 8 years ago	Bill D. Thompson's concreteness estimates for Danish words, as detailed in ( )
SAM lexicon	7	about 6 years ago	sentiment analysis word list extended from AFINN to 4275 lines. Described in
Danish Swadesh List			List of Danish words of basic concepts from The Rosetta Project
Sketch Engine			cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use
Awesome Danish / Data / Word sets
Danish-Similarity-Dataset	8	about 6 years ago	Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in
Wordsim353-da	18	almost 6 years ago	Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in
Clinical similarity dataset	0	over 3 years ago	289 word pairs score for similarity
Four words	18	almost 6 years ago	100 odd-one-out sets of 4 words or phrases
Awesome Danish / Data / Embeddings
cc.da.300			( ) - fastText-trained embedding on Danish part of and Danish Wikipedia. Read more about the method in ( )
wiki.da			( ) - fastText-trained embedding on Danish Wikipedia. Read more about the method in ( )
Byte-Pair Encoding embedding	1,189	almost 2 years ago	Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300)
NLPL word embeddings repository			NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020
Awesome Danish / Data / Embeddings / NLPL word embeddings repository
Danish NLPL word embedding			100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus
Awesome Danish / Data / Embeddings
Danish DSL and Reddit word2vec word embeddings			300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit
Awesome Danish / Data / Neural text models
A-ttack	6	about 3 years ago	Ælæctra-based model for detection of "textual attacks" developed by . Related to the Ha-te model
Danish BERT	164	over 4 years ago	Certainly's (Botxo/Møllerhøj) Weights for a BERT trained on a large Danish corpora
Danish ELECTRA	30	almost 5 years ago	Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library
daT5-summariser			Danish abstractive summarisation of news articles based on mT5-base
ConvBERT	30	almost 5 years ago	Philip Tamimi-Sarnikowski's model
Danish ELMo on OSCAR			(Link does not work as of December 2020)
Ha-te	6	over 4 years ago	Hate speech detection based on Ælæctra developed by . Related to the A-ttack model
mfaq			Multilingual FAQ retrieval model. Described in ( )
Ælæctra			Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
Multilingual sentence transformers			Pre-trained multilingual sentence transformers,
wiki40b-lm-da			language model trained on Danish from Wiki40B dataset
WikiBERT	34	about 6 years ago	BERT model for many languages, including Danish. Described in ( )
Awesome Danish / Data / Neural speech models
Hugging Face			List of models for Danish automatic speech recognition
Alvenir Wav2vec2			Pretrained Danish neural model
Whisper			Multilingual neural model from OpenAI
xls-r-300m-danish-nst-cv9			Pretrained Danish neural model
Awesome Danish / Tools / Lemmatization
Lemmy	76	almost 5 years ago	Lemmatizer for Danish in Python
cstlemma	36	about 2 years ago	lemmatiser
spaCy			Python-based package with lemmatization
Awesome Danish / Tools / Punctuation
punctfix	23	over 2 years ago	"Adds punctuation and capitalization for a given text."
Awesome Danish / Tools / Named entity recognition
ScandiNER			Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese
DaLUKE	9	over 4 years ago	Danish named entity recognition based on LUKE. Described in
spaCy			Python-based named entity extraction
daner	17	about 7 years ago	Named entity extraction from ITU NLP. Described in ( )
flair+danlp ner-tagger	199	over 2 years ago	Flair NER tagger trained by the Alexandra Institute
Polyglot named entity extraction			-
Awesome Danish / Tools / Entity linking
Babelfy			Web app and service for linking words and entities
DBpedia Spotlight			DBpedia-based entity linker. Described in ( )
Awesome Danish / Tools / Sentiment analysis
afinn	453	over 4 years ago	Python package with AFINN Danish lexicon annotated for sentiment, also installable with
Hisia	13	about 2 years ago	Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel
senda	19	about 5 years ago	Python package with transformer-based sentiment analysis from Ekstra Bladet Analyse with as of 2021 on one dataset
Sentida	20	over 4 years ago	R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in ( )
Awesome Danish / Tools / Automatic Speech Recognition
danspeech	29	over 3 years ago	DeepSpeech2-based Danish speech recognition in Python
kaldi-sprakbanken	14,362	over 1 year ago	A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database
Awesome Danish / Tools / Speech Synthesis (text-to-speech)
espeak			An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe
ResponsiveVoice			Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use
Google Cloud Text-to-Speech			Commercial Web-based text-to-speech synthesis for a number of languages, including Danish
Amazon Polly			Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at
Awesome Danish / Tools / Fundamental processing
DaNLP	199	over 2 years ago	"a repository for Natural Language Processing resources for the Danish Language."
dapipe	7	about 8 years ago	Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies
UDPipe			Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at
DKIE			GATE pipeline including wrapped Danish models for Stanford CoreNLP
StanfordNLP			. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available
bornholmsk	2	about 4 years ago	Datasets and embeddings for the Bornholmsk dialect
spaCy			Python-based natural language processing package
dacy	93	over 1 year ago	Danish spaCy pipeline
Awesome Danish / Competitions
ELEXIS Monolingual Word Sense Alignment Task			Predicting the relationship between two senses in each of several languages, including Danish
OffensEval 2020 - Danish			Offensive Language Identification in Social Media competition. Described in ( )
Awesome Danish / Benchmarks
Danoliterate			Overview of the performance of language models on a range of individual benchmarks
ScandEval			Overview of the performance of language models on a range of individual benchmark, Danish as well as other Germanic languages
Awesome Danish / Resources about resources
Danish resources			Finn Årup Nielsen's PDF with pointers to Danish resources
Scholia's topic aspect for Danish			, works (mostly scientific articles) about "Danish" as listed in Wikidata
DaNLP	199	over 2 years ago	Alexandra Institute's list of Danish resources
Language Technology Resources for Danish			, list from Det Dansk Sprog- og Litteraturselskab
European Language Resources Association (ELRA) list for Danish			, list of various annotated corpora available for purchase with both commercial and non-commercial licenses
sprogteknologi.dk			List of Danish language resources. Compiled by the Agency for Digitisation

awesome-danish

Awesome Danish / Data / Corpora

Awesome Danish / Data / Corpora / CLARIN-DK-UCPH

Awesome Danish / Data / Corpora

Awesome Danish / Data / Corpora / Wikipedia

Awesome Danish / Data / Corpora

Awesome Danish / Data / Parallel corpora

Awesome Danish / Data / Spoken language corpora

Awesome Danish / Data / Spoken language corpora / NST

Awesome Danish / Data / Spoken language corpora

Awesome Danish / Data / Dictionaries and ontologies

Awesome Danish / Data / Dictionaries and ontologies / Retskrivningsordbogen

Awesome Danish / Data / Dictionaries and ontologies

Awesome Danish / Data / Dictionaries and ontologies / 1,290,000 lexemes

Awesome Danish / Data / Dictionaries and ontologies

Awesome Danish / Data / Word sets

Awesome Danish / Data / Embeddings

Awesome Danish / Data / Embeddings / NLPL word embeddings repository

Awesome Danish / Data / Embeddings

Awesome Danish / Data / Neural text models

Awesome Danish / Data / Neural speech models

Awesome Danish / Tools / Lemmatization

Awesome Danish / Tools / Punctuation

Awesome Danish / Tools / Named entity recognition

Awesome Danish / Tools / Entity linking

Awesome Danish / Tools / Sentiment analysis

Awesome Danish / Tools / Automatic Speech Recognition

Awesome Danish / Tools / Speech Synthesis (text-to-speech)

Awesome Danish / Tools / Fundamental processing

Awesome Danish / Competitions

Awesome Danish / Benchmarks

Awesome Danish / Resources about resources

Backlinks from these awesome lists:

More related projects: