awesome-linguistics

Language toolkit

A curated collection of resources and tools for linguistics and natural language processing

A curated list of anything remotely related to linguistics

GitHub

371 stars
27 watching
29 forks
last commit: 12 days ago
Linked from 3 awesome lists

awesome-listlanguagelinguisticsresources

Platforms and toolkits

CLARIN-D web tools Tools for Analysing Research Data
CorpusExplorer Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 50 interactive visualizations under a user-friendly interface
Haxe-linguistics 26 over 3 years ago Early linguistical analysis and natural language processing library for Haxe
Natural 10,625 3 months ago General natural language tools for Node.js
Natural Language ToolKit (NLTK) The most complete platform for building Python programs to work with human language data
Snowball Snowball is a language in which stemming algorithms can be easily represented
Spacy Industrial-strength National Language Processing in Python
Mate Tools , webservice via
UBIAI Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling
textblob-de 104 over 3 years ago Nice alternative for spacy (see above)
UralicNLP 70 16 days ago An open source Python library for processing morphologically rich and, for the most part, endangered Uralic languages. It can do morphological analysis, generation, lemmatization, disambiguation and lexical lookup for a great many Uralic languages

Algorithms

Stemming algorithms for various European languages Various stemming algorithms from snowball
The Porter Stemmer Algorithm The ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter

Data sets

EuroRomCom Data 20 about 7 years ago JSON formatted Pan-Romance word lists
Araneum Germanicum
CEHugeWebCorpus German corpus based on CommonCrawl
Digitales Wörterbuch der deutschen Sprache (DWDS)
GC4 Corpus (CommonCrawl)
IDS Corpora German Reference Corpus
Leipzig Corpora Collection sampled sentences in different languages
SdeWaC big german internet corpus
C-WEP
DysList (list of dyslexic errors) 5 almost 6 years ago
Falko
Litkey
OpinionSpam 2 about 7 years ago

Resources

Low Resource Languages 390 7 months ago A list of resources for conservation, development, and documentation of low resource (human) languages
Language Science Press Language Science Press is a born-digital scholar-led open access publisher in linguistics

Deep learning models and transformers

dbmdz BERT models 155 almost 2 years ago
Deepset German BERT model
Evaluating German Transformer Language Models with Syntactic Agreement Tests 7 over 1 year ago
German ELMo Model 28 almost 5 years ago
german-transformer-training 23 over 3 years ago
GermLM 14 over 5 years ago (NER exploration)
GerPT2 20 over 2 years ago
Sentence Transformers 15,329 6 days ago

On Wikipedia

Bag of words model
Document classification
Language models
Naive Bayes classification
Natural language processing
Outline of natural language processing
Parts of speech tagging
Sentiment analysis
Term frequency - inverse document frequency
Vector space model

On Youtube

Computational Linguistics Lecture Playlist (Youtube) Lectures for University of Maryland class on computational linguistics
The Virtual Linguistics Campus CC-licensed educational videos interconnected with Marburg University's e-learning platform of the same name

Books

Essentials of Linguistics, 2nd edition An introductory book (2nd edition)
Introduction to Linguistics
Natural Language Processing with Python The book from the NLTK package
Text Mining with R
Foundations of Computational Linguistics
Foundations of Statistical Natural Language Processing
Semisupervised Learning for Computational Linguistics
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition
The Oxford Handbook of Computational Linguistics

Standards

DTA Basisformat
ISO TC 37 SC 4
UIMA

Lists

15 most popular books on good reads
corpus-linguistics GitHub topics &
nlp-datasets 5,775 almost 2 years ago
NLP-progress 22,715 4 months ago
/r/LanguageTechnology/
awesome-nlp 16,768 about 1 year ago
Awesome Community-Curated NLP List 196 over 2 years ago
awesome-chinese-nlp 7,808 over 1 year ago
awesome-danish 165 17 days ago
awesome-hungarian-nlp 224 about 1 year ago
awesome Information Retrieval 1,069 over 1 year ago
Indonesian NLP 279 almost 3 years ago
Norwegian NLP resources 177 over 3 years ago
German NLP resources 451 22 days ago
awesome-nlp-polish 294 over 3 years ago
awesome-spanish-nlp 330 11 months ago
M. Weisser's list of NLP/Computational Linguistics Resources

Communities

Linguistics Stack Exchange
Untranslatable.co, Multilingual urban dictionary

Backlinks from these awesome lists:

More related projects: