low-resource-languages

Resources for conservation, development, and documentation of low resource (human) languages.

GitHub

388 stars
35 watching
56 forks
Language: TeX
last commit: 5 months ago
Linked from 3 awesome lists

awesomeawesome-listendangered-languageshuman-languagelanguage-documentationlanguage-learninglanguage-resourceslistlow-resource-languageslrlsminority-languagenatural-languagenatural-language-processingnlpresourced-languages

Generic Repositories / Single language lexicography projects and utilities / Utilities

Project for Free Electronic Dictionaries Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries
Webonary Site which hosts digital dictionaries for single languages
WeSay 18 about 1 month ago Allows language communities to build their own dictionaries. (by the SIL International)

Generic Repositories / Software

4lang 37 6 months ago Concept dictionary using Eilenberg machines
accentuate.us a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
alignment-with-openfst 21 almost 8 years ago This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing
Apertium Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs
ark-tweet-nlp 0 about 12 years ago CMU ARK Twitter Part-of-Speech Tagger ( )
ArtOfReading 1 over 4 years ago Index and processing scripts related to the Art Of Reading illustration collection
bayesline 0 over 7 years ago A Multinomial Bayesian Classification for Language Identification
bible-corpus-tools 14 almost 2 years ago A collection of tools for reading/processing the multilingual Bible corpus
BloomDesktop 38 15 days ago Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia…
BloomLibrary 4 over 3 years ago Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend.
brain 1 over 10 years ago Neural networks in JavaScript
Bristol Uni MT Morphology tools 2 almost 9 years ago This repo is a mirror of scripts previously available on . Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis
brown-cluster 423 about 1 year ago C++ implementation of the Brown word clustering algorithm
CasualCon CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count
cdec 183 over 4 years ago Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
charlint Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model
chorus 6 12 days ago A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed
clam 129 7 months ago Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice
CMU Sphinx CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems
cnminlangwebcollect 1 about 4 years ago Chinese minorities website languages detection and websites collection
Cog 22 12 months ago Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties.
convertextract 11 about 1 year ago Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting
CorpusTools 111 3 months ago Phonological CorpusTools
CTK 18 over 8 years ago Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: )
DataTags 0 almost 10 years ago A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. ( )
dataverse 879 9 days ago A data repository framework to share and publish research data
Dative 14 over 1 year ago Dative: software for linguistic fieldwork
dative 14 over 1 year ago A single-page application that interacts with multiple linguistic fieldwork web service databases.
DeepLearnToolbox 0 over 10 years ago Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started
Desmeme 4 6 months ago Database and tools for exploring linguistic templates
dictdb dictionary database for language translation
discoursegraphs 50 over 1 year ago Python-based tool to convert and merge multilayer annotated linguistic data
divvun-gramcheck 9 16 days ago This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline
divvun-keyboard 6 5 months ago keyboard apps for iOS and Android with keyboard layouts for indigenous and minority languages
divvunspell 14 4 months ago (below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster than . It uses the same zhfst files as , which are available for all languages in the GitHub org (see below)
DLTK 12 about 9 years ago Deutsch Language Tool Kit.
epitran 631 about 1 month ago Grapheme to Phoneme conversion (G2P) for many low-resource languages
ELDER: Endangered Language Data Electronic Repository 4 almost 13 years ago Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool
enchant 342 6 days ago enchant spellchecking library
exsite9 7 9 months ago ExSite9 is a desktop application that was built to facilitate researchers easily and quickly tagging their data files with descriptive metadata and subsequently packaging their data files and associated metadata ready for submission to a repository. ExSite9 also allows for the structural organisation of said files within actually moving their physical location on your local file storage; allowing you to correctly organise your files and metadata ready for packaging
fast_align 732 about 2 years ago Simple, fast unsupervised word aligner
fastText 25,869 7 months ago Library for fast text representation and classification
FieldWorks 82 9 days ago FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology
Franc 4,116 4 months ago Natural language detection
FwDocumentation 8 about 2 months ago Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts)
FwLocalizations 0 4 months ago Localizations for FieldWorks
FwSupportTools 2 8 months ago Additional tools for FieldWorks development
Gaia 2,096 over 3 years ago Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see . If you're interested in setting up a keyboard in new language, see
giellakbd-android 11 about 1 month ago A fork of LatinIME (by Google for Android), targeting marginalised languages that also deserve first-class status on mobile operating systems. Used by (see elsewhere on this page)
giellakbd-ios 29 about 2 months ago An open source reimplementation of Apple's native iOS keyboard with a specific focus on support for localised keyboards. Used by (see elsewhere on this page)
giza-pp 264 over 1 year ago GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models
gv-crawl 9 almost 10 years ago Global Voices bitext crawler for creating parallel corpora
GlotLID 85 3 months ago Fasttext language identification with support for more than 2000 labels
Glottolog data 12 almost 7 years ago provides comprehensive reference information for the world's languages
Gramadóir 13 about 1 year ago Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources
grind 5 about 4 years ago An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin
hermitcrab 1 over 2 years ago HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach
hfst-ospell 13 8 months ago HFST spell checker library and command line tool
hfst-ospell-js 0 almost 8 years ago Node bindings for hfst-ospell
hfst-optimized-lookup 12 over 6 years ago HFST optimized-lookup standalone library and command line tool
hundict 21 over 10 years ago bilingual dictionary extractor from parallel corpora
hunspell 2,106 about 2 months ago Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding
huntag 22 over 8 years ago a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
icu-dotnet 62 about 1 month ago C# wrapper for ICU4C
icu4c 6 about 6 years ago Mirror of svn project at . The FieldWorks branch has some FieldWorks specific enhancements
iLanguage 21 almost 7 years ago A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics
ipa-help 0 almost 7 years ago IPA Helps
itweets-geodata 0 over 3 years ago Geodata from Indigenous Tweets
jQuery.ime 173 10 days ago jQuery based input methods library
kbdgen 13 6 months ago Generate keyboards and keyboard layouts for various operating systems
koreksyon 3 about 9 years ago Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages
l20n.js 901 over 5 years ago L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n.
langid.py 2,302 almost 5 years ago Stand-alone language identification system
langtech A host of resources provided in SVN by the University of Tromsø. Details are and in English
LEGO Unified Concepticon 0 about 11 years ago Material relating to the LEGO Unified Concepticon
Lex4All 21 about 4 years ago pronunciation LEXicons for Any Low-resource Language
lexdb LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework
LfMerge 2 9 days ago Send/Receive for languageforge.org
liblevenshtein 67 almost 4 years ago A library for generating Finite State Transducers based on Levenshtein Automata
libpalaso 44 23 days ago Palaso Library: A set of .Net libraries useful for developers of Language Software
LinGO Grammar Matrix The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages
Lingpy 122 10 months ago LingPy: Python library for quantitative tasks in historical linguistics
Linguistica Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed
long-press 306 almost 5 years ago jQuery plugin to ease the writing of accented or rare characters.
low-resource-pos-tagging-2014 9 over 8 years ago Low-Resource POS-Tagging: 2014
lrl 2 over 11 years ago For work concerning low resource languages
MacVoikko 6 over 9 years ago An OS X spelling server based on Voikko
Machine 26 8 days ago Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx)
Make-extensions 6 almost 7 years ago Scripts for generating hunspell spellchecking extensions
mgiza 161 over 3 years ago A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training
Minority Translate Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions
morfessor 181 almost 4 years ago Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
morpholm 3 over 11 years ago Morphology-aware language models
morph-test 2 over 3 years ago A python script to run tests for generation and analysis of a morphological transducer built using the Giella infrastructure. Works with Hfst, Xerox' fst tools, and with Foma
mosesdecoder 1,577 4 months ago Moses, the machine translation system
moz-l10n-tiers 0 almost 11 years ago Creates a pseudo-locale to evaluate string prioritization for l10n
mukurtucms 82 about 2 months ago The Mukurtu Content Management System (CMS) is an Internet- based platform designed to enable archiving of digital cultural resources
mythes 39 over 1 year ago MyThes is a simple thesaurus that uses a structured text data file and an index file with binary search to lookup words and phrases and return information on part of speech, meanings, and synonyms
myWorkSafe 1 about 6 years ago Smart & Simple Backup for Language Development Workers.
nabu 19 9 days ago nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items
Natural 10,583 about 2 months ago general natural language facilities for node
NIST 2008 Open Machine Translation Evalutation
NLTK 13,471 10 days ago Natural Language Tool Kit. NLTK Source
node-panlex 6 over 5 years ago node.js client for PanLex
norma 20 over 3 years ago A tool for automatic spelling normalization
nplm 14 about 9 years ago Fork of with some efficiency tweaks and adaptation for use in mosesdecoder
octothorpe 0 over 11 years ago CouchDB-powered wiki thing
OdtXslt 2 over 7 years ago Perform XSLT transform on contents of a package (such as ODT, Docx, etc.)
old-webapp 4 almost 10 years ago Online Linguistic Database --- software for creating web applications to collaboratively document languages.
old 1 about 4 years ago The Online Linguistic Database (OLD): software for linguistic fieldwork.
old-pyramid 8 over 1 year ago Online Linguistic Database migrated to the Pyramid framework
OmegaT-hfst-tokenizer 2 over 4 years ago OmegaT-hfst-tokenizer provides fst-based tokenisation in OmegaT
OpenDataKit Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions
OpenNLP 1,425 3 days ago The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
ops-devbox 8 over 1 year ago Ansible playbook for a (linux) developer machine
panlex-tools 8 almost 2 years ago This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at
pdsc-collection-viewer 4 about 2 years ago Paradisec Collection Browser
paradigm 1 almost 4 years ago PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program"
pathway 7 4 months ago Preparing language data for publication
pdfdroplet 6 about 2 months ago Library and GUI for imposition of PDF pages (e.g. 2-up)
pepper 23 over 1 year ago Pepper is a pluggable, Java-based, open source converter framework for linguistic data
phonology-assistant 9 almost 2 years ago Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language
pressagio 19 almost 5 years ago Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string
PrimerPro 1 almost 6 years ago The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language
pyDelphin 79 26 days ago Python libraries for DELPH-IN (Friendly Fork)
RBGParser 46 over 8 years ago Graph-based Dependency Parser
Rosetta Pangloss 0 over 9 years ago The Rosetta Project's Pangloss system
salm 11 almost 7 years ago SALM: Suffix Array and its Applications in Empirical Language Processing by Joy
Salt 15 over 1 year ago A graph-based model to store and manipulate linguistic data
saymore 6 about 2 months ago A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription
Secwepemc-Facebook 13 over 9 years ago Translate Facebook into unsupported languages
SegParser 9 almost 9 years ago Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
SeedLing 11 over 6 years ago Building and Using A Seed Corpus for the Human Language Project
Skype in your language 3 almost 9 years ago Translate Skype into unsupported languages
solid 1 about 1 month ago Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data
SPHERE Conversion Tools Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats
StandardFormatLib 0 over 9 years ago Standard Format Library
Stanford CoreNLP 9,658 11 days ago Stanford CoreNLP: A Java suite of core NLP tools.
Stanford CoreNLP Python 611 over 6 years ago Python wrapper for Stanford CoreNLP tools
stanza 7,249 13 days ago Stanford NLP group's shared Python tools
str2ipa 10 almost 9 years ago Pronunciation dictionaries for languages with close-to-phonetic writing systems
sugali 2 about 2 years ago This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages
SuGarLike 1 about 10 years ago Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
SyllabiPy 43 almost 2 years ago Python interface for universal syllabification algorithms
tasty-imitation-keyboard A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies!
TECkit 17 8 months ago A Text Encoding Conversion toolkit
teny 3 almost 12 years ago Tools for low-resource machine translation
TeraDict 6 over 5 years ago Translate English words into hundreds of languages!
Tesseract.js 34,840 11 days ago Pure Javascript OCR for 62 Languages 📖🎉🖥
TexNLP 14 over 12 years ago TexNLP: Texas Natural Language Processing tools
TiMBL TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases
Toney 5 about 10 years ago Tone Classification Software
Field Linguist's Toolbox Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data
Toolbox Scripts for ELAN 0 over 9 years ago Mirror of Alexander Koenig's Toolbox Scripts
ToolsForFieldLinguistics 9 over 5 years ago A collection of scripts and recipes for linguistics
transcriber 1 over 9 years ago An HTML5 transcription tool for Aikuma
translitit-engine 2 over 6 years ago A transliteration engine written in JavaScript
Tsammalex data 6 over 6 years ago is a multilingual lexical database on plants and animals
tweet2learn 3 over 5 years ago An app to make it easier to use your native language on Twitter
twitter_langid 15 over 7 years ago A hierarchical character-word neural network for language identification
UniversalDependencies docs 269 8 days ago Universal Dependencies online documentation
UniversalDependencies tools 203 8 days ago Various utilities for processing the data
VocBench VocBench is a web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL
wavesurfer.js 8,667 25 days ago Navigable waveform built on Web Audio and Canvas (Also has an ELAN plugin)
web-template 3 over 9 years ago This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases
webcorpus 8 over 9 years ago This project is a collection of scripts and programs for creating a webcorpus from crawled data
wikt2dict 53 about 2 years ago Wiktionary parser tool for many language editions
wikipron 315 14 days ago -- retrives IPA pronunciations for Wiktionary entries
Word Generator WordGenerator generates hypothetical words from specifications of their syllable structure
WordBoundary An experiment in the detection and segmentation of word boundaries
wordbyword 1 about 10 years ago WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages
WSI4URLang 0 almost 4 years ago Word Sense Induction (WSI) for Under-resourced Languages (URLang)
XDXF_Makedict 226 5 months ago XDXF dictionary format and "makedict" dictionary converting software (official repository)

Keyboard Layout Configuration Helpers

jQuery.IME 173 10 days ago jQuery Input Method Editor used on Wikipedia
kbdgen 13 6 months ago Generate keyboards and keyboard layouts for Windows, macOS, X11, iOS, Android and Chrome, from a single, simple yaml file. Also registers languages unknown to Windows, so that after installation, there is a correct and robust association between the designated BCP 47 code (including full support for ISO 639-3) and installed language tools such as keyboards, spelling checkers and other tools
Keyboard 1,775 about 2 years ago Virtual Keyboard using jQuery ~
Keyboards 149 10 days ago Open Source Keyman keyboards
Keyman 390 8 days ago Keyman cross platform input methods. Keyman makes it possible for you to type in over 1,000 languages on Windows, iPhone, iPad, Android tablets and phones, and even instantly in your web browser.
keyboardlayouteditor 244 over 2 years ago Keyboard Layout Editor
Keyboard layout editor 1,297 19 days ago Keyboard Layout Editor
lipika-ime 116 4 months ago Input Method Engine (IME) for Mac OS X with built-in support for all Indic Languages
XKeyboardConfig The non-arch keyboard configuration database for X Window. The goal is to provide the consistent, well-structured, frequently released open source of X keyboard configuration data for X Window System implementations (free, open source and commercial). The project is targeted to XKB-based systems

Annotation

AGTK 0 over 8 years ago AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: )
brendano 8 over 9 years ago Graph Fragment Language for Easy Syntactic Annotation
ELAN ELAN is a professional tool for the creation of complex annotations on video and audio resources
eopas 9 over 1 year ago ETHNOER Online Presentation and Annotation System
FLAT - FoLia Linguistic Annotation Tool 110 3 months ago FLAT is a web-based linguistic annotation environment based around the FoLiA format ( ), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure
gfl_syntax 8 over 9 years ago Graph Fragment Language for Easy Syntactic Annotation
graf-python 21 about 10 years ago The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python
kwaras 7 11 months ago Tools for ELAN corpus management
LDC Word Aligner 2 over 6 years ago LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources.
poio-analyzer 13 about 11 years ago Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt
poio-api 18 over 6 years ago Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F…
pyannotation 16 about 12 years ago PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files
XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics

Format Specifications

spec 21 over 1 year ago The official specification for the DLx linguistic data format.
FoLiA 60 5 months ago FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange
xdxf_makedict 226 5 months ago XDXF dictionary format and "makedict" dictionary converting software (official repository)
Express-Lingua 66 over 10 years ago An i18n middleware for the Express.js framework
Polyglot.js Give your JavaScript the ability to speak many languages
Transifex System for providing a nice, userfriendly/project oriented approach to translating files. Great for non-technical users, free for open-source projects, decent for minority languages; , it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared

Audio automation

arctic-prompts 1 over 8 years ago Generate prompts PDF for CMU ARCTIC dataset
AudioWebService 4 over 1 year ago a simple nodejs server which accepts upload of audio and runs it through praat
AuToBI 56 over 5 years ago Automatic prosodic annotation tool written in Java
BashScriptsForPhonetics 0 almost 11 years ago ( of a dormant project)
esv-text-audio-aligner 93 over 11 years ago ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio
html5-audio-read-along 191 almost 7 years ago HTML5 Audio Read-Along
ipa-chart 129 over 3 years ago International Phonetic Alphabet (IPA) Unicode Chart and Character Picker
kaldi-svn-archive 16 about 9 years ago An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available)
lex4all 1 over 10 years ago pronunciation LEXicons for Any Low-resource Language ( of a student project)
Montreal-Forced-Aligner 1,306 15 days ago Python interface for forced text/speech alignment
node-pocketsphinx 242 over 5 years ago
opensauce 5 over 7 years ago GNU Octave-compatible version of VoiceSauce
pocketsphinx 3,907 12 days ago PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop
pocketsphinx-ios-demo 76 about 6 years ago Simple demo for iOS
pocketsphinx-python 340 over 2 years ago Python module installed with setup.py
pocketsphinx-ruby 13 over 9 years ago Ruby speech recognition with Pocketsphinx
pocketsphinx-wp-demo 21 over 8 years ago Demo to run pocketsphinx on WP8 platform
pocketsphinx.js 1,492 over 4 years ago Speech recognition in JavaScript
praat-py 0 almost 12 years ago From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. ( of a dormant project)
Praat-Scripts 52 almost 3 years ago Mietta's Scripts
PraatTextGridJS 11 almost 3 years ago A small library which can parse TextGrid into json and json into TextGrid
PraatontheWeb 37 about 3 years ago Web implementation of Praat. Source code, running demo scripts on web, samples and documentation
prosodicParsing 2 over 12 years ago different kinds of HMMs to use for incorporating prosody into basic parsing
Prosodylab-Aligner 331 over 4 years ago Python interface for forced audio alignment using HTK and SoX
prosodylab.alignertools 12 over 9 years ago
Recordmp3js 2 about 9 years ago Record MP3 files directly from the browser using JS and HTML
sphinx4 1,404 almost 2 years ago Pure Java speech recognition library
sphinxbase 530 over 2 years ago
sphinxtrain 181 about 1 month ago
TLSphinx 15 almost 6 years ago Swift wrapper around Pocketsphinx

Text-to-Speech (TTS)

espeak eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows.
MARY TTS 2,331 over 1 year ago MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java
Ossian Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision

Automatic Speech Recognition (ASR)

Elpis 152 4 months ago Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers
kaldi 14,154 19 days ago This is now the official location of the Kaldi project
Persephone 155 over 1 year ago Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis

Text automation

clld 54 12 days ago Cross Linguistic Linked Data python library
LaTeX2HTML5 62 over 2 years ago LaTeX web components
MultilingualCorporaExtractor 0 over 11 years ago Node io Spider for extracting multilingual corpora ( of a student project)
SeedLing 2 over 10 years ago Building and Using A Seed Corpus for the Human Language Project ( of a student project)

Experimentation

experigen 34 about 4 years ago A framework for creating linguistic experiments
GamifyPsycholinguisticsExperiments 0 over 12 years ago A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. ( of a dormant project)
OpenSesame 236 3 months ago Graphical experiment builder for the social sciences
OPrime 0 almost 10 years ago Open Source Experimentation Libraries - Online and Offline for Android and HTML5
psychopyMegProsody 0 almost 12 years ago Runs MegProsody using PsychoPy
PsychScript 4 almost 10 years ago A HTML5/Javascript library for running behavioural experiments online

Flashcards

Anki 18,389 12 days ago Anki is a program to make and share flaschard decks (including audio) for any language or writing system.
awesome-anki 1,584 3 months ago A curated list of awesome Anki add-ons, decks and resources
VocabLift 3 over 10 years ago Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay

Natural language generation

OpenCCG 204 over 3 years ago OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others

Computing systems

Common Language Resources and Technology Infrastructure Norway / Clarino One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the cluster

Android Applications

Aikuma 30 over 8 years ago Android software for recording and translation
Android Speech Recognition Trainer 3 almost 6 years ago Speech recognition training app for low resource languages which interfaces with FieldDB corpora
android-template 0 over 9 years ago This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to
AndroidFieldDB 3 over 5 years ago An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers
AndroidFieldDBElicitationRecorder 2 almost 11 years ago A general purpose video recording tool
AndroidLanguageLessons 2 almost 6 years ago Lets heritage speakers create self designed language lessons
AndroidProductionExperiment 0 about 11 years ago Android App to run perception experiments
Bevara 3 almost 11 years ago Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages
ojoVoz A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to
pocketsphinx-android 234 over 4 years ago pocketsphinx build for Android
pocketsphinx-android-demo 547 almost 6 years ago

Chrome Extensions

babelfrog 16 over 5 years ago Chrome extension to help learn languages as you browse
DictionaryChromeExtension 6 over 9 years ago Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries)

FieldDB

FieldDB 79 almost 2 years ago An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival.

FieldDB / FieldDB Webservices/Components/Plugins

AndroidLanguageLearningClientForFieldDB-sikuli 0 about 10 years ago Sikuli tests for AndroidLanguageLearningClientForFieldDB
AuthenticationWebService 0 over 1 year ago A node.js web service which mananges users and corpora creation and authentication
bower-fielddb-angular 0 about 9 years ago A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save
bower-fielddb 0 about 4 years ago A bower repository which hosts fielddb core components, bower install fielddb --save
fielddb-spreadsheet-sikuli 1 over 9 years ago sikuli tests for the spreadsheet module
FieldDBActivityFeed 0 over 9 years ago A fielddb activity feed widget which can be embedded in other codebases, websites etc
FieldDBGlosser 0 over 7 years ago A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save
FieldDBLexicon 0 almost 7 years ago A lexicon browser/editor web widget for FieldDB databases
LanguageClassDashboard 0 about 10 years ago App which provides a view of FieldDB corpora for language teachers
LexiconWebService 0 about 4 years ago A node.js ElasticSearch wrapper for indexing/training lexicons from corpora
LexiconWebServiceSample 1 over 12 years ago A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project

Academic Research Paper-Specific Repositories

Gargantua 12 almost 9 years ago Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010
ldc-kiy 0 over 11 years ago Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation,
Learning to map into a Univerisal POS tagset Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
low-resource-pos-tagging-2014 9 over 8 years ago and Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. . In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. . In Proceedings of ACL 2013
orthotree 10 over 9 years ago Linguistic family tree based on orthographic distance
type-supervised-tagging-2012emnlp 1 over 8 years ago This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. . In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit
visualizing-language 1 over 12 years ago For visualizations of WALS and other typological databases
WALS-APiCS 0 over 9 years ago Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics

Example Repositories

CorpusWebService 0 over 2 years ago über-simple node.js-Proxy to enable CORS request for couchdb
CorporaForFieldLinguistics 3 about 7 years ago Small corpora from diverse language typologies, useful for testing scripts
startR 0 almost 12 years ago
lucenerevolution-2013 0 over 11 years ago Demo examples for linguistics in Lucene and Solr
berlin-buzzwords-2013 0 over 11 years ago Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk

Fonts

fontinline 4 about 6 years ago Make inline stroke paths from an outline font
Noto Fonts 2,454 over 1 year ago Noto is Google’s free font family that aims to support all the world’s scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0
Unicodify Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII

Corpora

bible-corpus 172 15 days ago A multilingual parallel corpus created from translations of the Bible
poio-corpus 7 9 months ago The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others

Organizations / On GitHub

batumi Speech recognition and natural language processing for low-resource languages
BloomBooks
unicode-cldr Unicode Common Locale Data Repository (CLDR) Project
cmusphinx Mirror of the SourceForge repositories
dativebase Tools for working with OLD
divvun The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages.
FieldDB
GiellaLT home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source
HFST Helsinki Finite-State Technology.
hunspell
keymanapp
langtech Language Technology Group, University of Melbourne
lex4all
longnow
MontrealCorpusTools
moses-smt Statistical Machine Translation
mukurtucms
NLTK Natural Language Toolkit
PhonologicalCorpusTools)
Projet de recherche sur l'écriture Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics)
prosodylab Prosodylab at McGill University, Canada
SIL International (Dev) Another SIL organization, with many repositories
SIL International SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects
SIL NRSI SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development
StanfordNLP
ucsd-field-lab University of California, San Diego
UniversalDependencies Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary
utcompling The University of Texas at Austin's Computational Linguistics Lab.

Organizations / Other OSS Organizations

Giellatekno Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found , sorted by language
LOWLANDS LOWLANDS – Parsing low-resource languages and domains
LTRC: Language Technologies Research Center IIIT Hyderabad LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above
The Language Archive Part of the MPI

Tutorials

How to Write a Spelling Corrector by

Language Specific Projects / Afrikaans

Afrikaanse rekenaarlinguïstiek (Afrikaans computational linguistics) — wordlists, corpora, morphological analyser, tagger, word decompounder. Available upon email

Language Specific Projects / Albanian

Apertium rules for Albanian Machine Translation rules
out-of-copyright-albanian-authors authors scraped from the albanian language wikipedia who are out of copyright
Plis keyboard The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language
spell checking Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included

Language Specific Projects / Alutiiq

wiinaq 2 over 1 year ago Word Wiinaq is a dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django

Language Specific Projects / Amharic

HornMorpho 5 over 9 years ago Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

Language Specific Projects / Basque

Matxin An open-source transfer machine translation engine. Linguistic information for the translation from Spanish and Basque (es-eu) is included

Language Specific Projects / Bengali

Bangla-অঙ্কুর for Mac This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future
Bengali Writer 1 over 8 years ago `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: )
Ekushey Bangla Computing and Localization Project for the Bangla speaking people
Lekho 0 over 8 years ago A collection of tools and resources for using bangla on computers (Original project is on SourceForge: )

Language Specific Projects / Chichewa

Chichewa 8 over 3 years ago NLP resources for Chichewa

Language Specific Projects / Galician

an-metri-gal 3 about 2 years ago Análise métrico de texto en verso en lingua galega (Galician language) gl-ES
android_gl_dict 2 over 11 years ago Android Galician (gl_ES) Keyboard Dictionary
aspell-gl 1 about 12 years ago Galician dictionary for aspell
CitiusSentiment 7 over 8 years ago Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician
CitiusTagger A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish
Conshuga Galician verb conjugator
corpora 2 over 8 years ago This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia)
DepPattern 10 over 6 years ago Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser
DOGA_scraper 0 about 10 years ago Galician Official journal scraper
elFinder-language 1 about 8 years ago Galician - Gallego / language for elFinder
EuroWordNetLemon 1 about 9 years ago EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician
GalegoDroid Galician Translator for Android
galeXtra 2 over 8 years ago Multiword Extractor for Portuguese, English, Spanish, Galician, French
Galician-Dependency-Treebank 1 almost 8 years ago This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006
Galician-Fuzzy-Text-watch 1 almost 9 years ago Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time
galician-locale-for-mac 1 over 8 years ago Galician locale for Mac OS X
gl-syllabler 1 over 8 years ago Split galician language words into syllables
gl 1 about 2 months ago Galician OmegaT Localisation
hunspell-gl-ciencias 0 about 11 years ago Project oriented into developing a science and maths Galician language Hunspell dictionary
hunspell-gl 1 over 11 years ago Galician hunspell dictionaries
hyphen-gl 1 over 12 years ago Galician hyphenation rules
javagalician-java6 3 over 12 years ago The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution
Linguakit 64 7 months ago Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc
ParlamentoGalicia 0 over 11 years ago Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament
poss-gl 1 about 13 years ago Galician translation of Producing Open Source Software, by Karl Fogel
rima 1 over 8 years ago Find rhyming words in galician language
stopwords-gl 1 almost 8 years ago Galician stopwords collection
texlive-babel-galician 1 about 1 year ago TeXLive babel-galician package
UD_Galician-CTG 1 5 months ago The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group
UD_Galician-TreeGal 7 5 months ago The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña)
UL_Galician-TreeGal 0 over 6 years ago CoNLL-UL Repository for UD_Galician-TreeGal

Language Specific Projects / Galician / Apertium

apertium-cat-glg 1 about 2 years ago Apertium translation pair for Catalan and Galician
apertium-dict-en-gl 1 almost 9 years ago English-Galician language pair for Apertium
apertium-dict-es-gl 1 almost 9 years ago Spanish-Galician language pair for Apertium
apertium-dict-pt-gl 1 over 11 years ago Portuguese-Galician language pair for Apertium
apertium-en-gl 0 over 2 years ago Apertium translation pair for English and Galician
apertium-es-gl 1 about 3 years ago Apertium translation pair for Spanish and Galician
apertium-glg 0 about 2 years ago Apertium linguistic data for Galician
Apertium-pt-gl.pt-gl-LMF 0 about 10 years ago This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages
apertium-pt-gl 0 about 3 years ago Apertium translation pair for Portuguese and Galician

Language Specific Projects / Georgian

awesome-georgia 88 about 1 year ago A curated list of awesome libraries and packages specific/related to Georgia (country)
Gadatsqvetilebebi 1 over 7 years ago გადაწყვეტილებები; Web spider and corpora importer for public legal decisions
GeoWordsDatabase 69 almost 7 years ago Around 310 000 unique Georgian words
Kartuli Speech Recognition 4 almost 7 years ago ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes."
KartuliChromeExtension 1 over 10 years ago Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად
QartuliDaBunebismetkveleba 1 almost 11 years ago მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის
SakartvelosUzenaesiSasamartloSarke 0 over 10 years ago საქართველოს უზენაესი სასამართლო სარკე
SamartlosSakonstitutsioSasamartdoSarke 0 over 7 years ago სამართლოს საკონსტიტუციო სასამართდო სარკე
translitit-latin-to-mkhedruli-georgian 4 over 7 years ago A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript
translitit-mkhedruli-georgian-to-ipa 0 over 7 years ago A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript
Declensions 2 almost 2 years ago Methods to generate declensions for Georgian language

Language Specific Projects / Georgian / Fonts

Stichoza/font-larisome 39 over 3 years ago Iconic font for Georgian currency inspired by Font-Awesome (CSS)
Lotuashvili/BPGNateli 0 about 9 years ago Bower package for BPG Nateli font (CSS)
thecotne/georgian-webfonts 17 about 7 years ago Package for georgian fonts (CSS)

Language Specific Projects / Georgian / Internationalization and Localization (i18n/l10n)

Stichoza/money-num-to-string 7 9 months ago Convert a number/money to localized string (PHP, JavaScript)
natchkebiailia/NumberToWord 3 almost 7 years ago Convert numbers to localized strings (JavaScript)
d0ragon/number-to-words-ka 3 over 10 years ago Convert numbers to localized strings (PHP)
dimakura/ka 0 almost 11 years ago Common functionality for georgian projects (Ruby)
dimakura/ka.js 5 about 10 years ago Georgian language support for node and browser (JavaScript)
akalongman/kautilities 4 over 8 years ago Convert Georgian letters to Latin and vice-versa (PHP)
Landish/Laravel-Ka 8 over 3 years ago Georgian Language Pack
Landish/RedactorJS-GE Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript)
wenzhixin/bootstrap-table 11,730 2 days ago Bootstrap table with extra features. l10n by and
moment/moment 47,963 about 2 months ago A lightweight date library (JavaScript)
ioseb/geokbd 58 almost 15 years ago Georgian keyboard library (JavaScript)

Language Specific Projects / Guarani

ParaMorfo 5 over 9 years ago morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives

Language Specific Projects / Hausa

Hausa 6 about 9 years ago Repository for Hausa NLP tools

Language Specific Projects / Hindi

hindi-morph 0 over 11 years ago An open source morphological analyzer for Hindi

Language Specific Projects / Høgnorsk

hunspell-hn_NO A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses

Language Specific Projects / Icelandic

IceNLP 20 8 months ago IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java

Language Specific Projects / Inuktitut

InuktitutAlignerData 3 over 12 years ago Scripts for alignment of laboratory speech production data
InuktitutComputing 10 about 9 years ago Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at

Language Specific Projects / Irish

aimsigh 1 about 1 year ago Source for the now-defunct aimsigh.com Irish search engine
caighdean 18 22 days ago Code for standardizing Irish language text
fleiscin 1 almost 4 years ago Irish hyphenation patterns for TeX
GaelSpell 17 18 days ago Sources for an Irish language spell checker
tesseract-gle-uncial 3 over 9 years ago OCR for old Irish fonts

Language Specific Projects / Kinyarwanda

kin-morph-fst 6 about 11 years ago Kinyarwanda morphological analyzer
TurboTagger & TurboParser for Kinyarwanda (download) TurboTagger & TurboParser for Kinyarwanda

Language Specific Projects / Kurdish

Kurlex Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR
kurmanji-stemmer 1 about 9 years ago NLTK based kurmanji stemmer

Language Specific Projects / Lingala

Lingala NLP NLP tools and resources for Lingala

Language Specific Projects / Lushootseed

Lushootseed 0 over 8 years ago Joshua Crowgey's work on Lushootseed

Language Specific Projects / Malay

MorfoMalayu 5 over 9 years ago morphological analysis of Malay words

Language Specific Projects / Malagasy

Global Voices Malagasy Project This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau

Language Specific Projects / Manx

aspell-gv 1 about 12 years ago Manx Gaelic dictionary for aspell
gaelg 3 about 1 month ago NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine

Language Specific Projects / Migmaq

migmaq-lessons 1 over 9 years ago Repository for website building Mi'gmaq language lessons

Language Specific Projects / Minderico

fredericajordarzambarino 0 about 10 years ago A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show

Language Specific Projects / Nishnaabe

Ojibway-iphone-app 0 about 9 years ago An iPhone app with audio and images for learning the Ojibway language
OjibwayMap 1 about 9 years ago An iPhone app with audio and images for learning Ojibway language and culture
nishanimate 1 about 9 years ago A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text

Language Specific Projects / Oromo

hornmorpho 5 over 9 years ago morphological analysis and generation of amharic and oromo verbs and nouns. and tigrinya verbs

Language Specific Projects / Quechua

AntiMorfo 5 over 9 years ago morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs
Morphology, spellchecker XFST and FOMA, plus OpenOffice plugin

Language Specific Projects / Sami

divvun-webdemo 2 about 1 year ago simple webdemo for divvun grammar checker.
Giellatekno A host of Sámi tools
Oahpa! A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
Neahttadigisánit A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography ( will be recognized also as ), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for . Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:

Language Specific Projects / Scottish Gaelic

aspell-gd 1 about 12 years ago Scottish Gaelic dictionary for aspell
briathrachan 2 almost 8 years ago This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS
gaidhlig 3 about 1 year ago NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines
gd-fcfg 3 over 12 years ago Context-free feature-based grammar of Scottish Gaelic in the NLTK format
gdbank 4 8 months ago Some tools and resources for natural language processing of Scottish Gaelic.
hunspell-gd 10 over 1 year ago Files for building Scottish Gaelic spell checkers

Language Specific Projects / Secwepemctsín

secwepemctsnem 2 almost 14 years ago A project to help people learn Secwepemctsín

Language Specific Projects / Somali

somorph Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on repository
qaamuus.net morphologically aware dictionary based on lexical resources found online, and the somali morphology

Language Specific Projects / Tigrinya

HornMorpho 5 over 9 years ago morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

Language Specific Projects / Uralic

UralicNLP 70 about 2 months ago A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time

Language Specific Projects / Zulu

Ukwabelana An open-source morphological Zulu corpus

Backlinks from these awesome lists: