low-resource-languages
Language preservation toolkit
A repository of tools and resources to support the documentation, conservation, and development of endangered languages.
Resources for conservation, development, and documentation of low resource (human) languages.
393 stars
35 watching
56 forks
Language: TeX
last commit: 7 months ago
Linked from 3 awesome lists
awesomeawesome-listendangered-languageshuman-languagelanguage-documentationlanguage-learninglanguage-resourceslistlow-resource-languageslrlsminority-languagenatural-languagenatural-language-processingnlpresourced-languages
Generic Repositories / Single language lexicography projects and utilities / Utilities | |||
Project for Free Electronic Dictionaries | Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries | ||
Webonary | Site which hosts digital dictionaries for single languages | ||
WeSay | 18 | 8 days ago | Allows language communities to build their own dictionaries. (by the SIL International) |
Generic Repositories / Software | |||
4lang | 37 | 9 months ago | Concept dictionary using Eilenberg machines |
accentuate.us | a.k.a. "charlifter". Statistical Unicodification of plain text for many languages | ||
alignment-with-openfst | 21 | about 8 years ago | This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing |
Apertium | Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs | ||
ark-tweet-nlp | 0 | over 12 years ago | CMU ARK Twitter Part-of-Speech Tagger ( ) |
ArtOfReading | 1 | almost 5 years ago | Index and processing scripts related to the Art Of Reading illustration collection |
bayesline | 0 | over 7 years ago | A Multinomial Bayesian Classification for Language Identification |
bible-corpus-tools | 15 | about 2 years ago | A collection of tools for reading/processing the multilingual Bible corpus |
BloomDesktop | 39 | 1 day ago | Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… |
BloomLibrary | 4 | almost 4 years ago | Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend. |
brain | 1 | almost 11 years ago | Neural networks in JavaScript |
Bristol Uni MT Morphology tools | 2 | about 9 years ago | This repo is a mirror of scripts previously available on . Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis |
brown-cluster | 425 | over 1 year ago | C++ implementation of the Brown word clustering algorithm |
CasualCon | CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count | ||
cdec | 183 | over 4 years ago | Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms |
charlint | Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model | ||
chorus | 7 | 7 days ago | A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed |
clam | 129 | 9 months ago | Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice |
CMU Sphinx | CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems | ||
cnminlangwebcollect | 1 | about 4 years ago | Chinese minorities website languages detection and websites collection |
Cog | 23 | about 1 year ago | Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. |
convertextract | 11 | over 1 year ago | Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting |
CorpusTools | 115 | about 2 months ago | Phonological CorpusTools |
CTK | 18 | almost 9 years ago | Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: ) |
DataTags | 0 | about 10 years ago | A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. ( ) |
dataverse | 894 | 1 day ago | A data repository framework to share and publish research data |
Dative | 14 | over 1 year ago | Dative: software for linguistic fieldwork |
dative | 14 | over 1 year ago | A single-page application that interacts with multiple linguistic fieldwork web service databases. |
DeepLearnToolbox | 0 | over 10 years ago | Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started |
Desmeme | 4 | 8 months ago | Database and tools for exploring linguistic templates |
dictdb | dictionary database for language translation | ||
discoursegraphs | 50 | over 1 year ago | Python-based tool to convert and merge multilayer annotated linguistic data |
divvun-gramcheck | 9 | 6 days ago | This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline |
divvun-keyboard | 6 | 3 months ago | keyboard apps for iOS and Android with keyboard layouts for indigenous and minority languages |
divvunspell | 14 | 29 days ago | (below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster than . It uses the same zhfst files as , which are available for all languages in the GitHub org (see below) |
DLTK | 12 | over 9 years ago | Deutsch Language Tool Kit. |
epitran | 668 | 4 months ago | Grapheme to Phoneme conversion (G2P) for many low-resource languages |
ELDER: Endangered Language Data Electronic Repository | 4 | about 13 years ago | Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool |
enchant | 1 | 2 months ago | enchant spellchecking library |
exsite9 | 7 | 11 months ago | ExSite9 is a desktop application that was built to facilitate researchers easily and quickly tagging their data files with descriptive metadata and subsequently packaging their data files and associated metadata ready for submission to a repository. ExSite9 also allows for the structural organisation of said files within actually moving their physical location on your local file storage; allowing you to correctly organise your files and metadata ready for packaging |
fast_align | 740 | over 2 years ago | Simple, fast unsupervised word aligner |
fastText | 25,979 | 9 months ago | Library for fast text representation and classification |
FieldWorks | 84 | 7 days ago | FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology |
Franc | 4,155 | 6 months ago | Natural language detection |
FwDocumentation | 8 | 25 days ago | Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts) |
FwLocalizations | 0 | 7 months ago | Localizations for FieldWorks |
FwSupportTools | 2 | 11 months ago | Additional tools for FieldWorks development |
Gaia | 2,096 | over 3 years ago | Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see . If you're interested in setting up a keyboard in new language, see |
giellakbd-android | 12 | 4 months ago | A fork of LatinIME (by Google for Android), targeting marginalised languages that also deserve first-class status on mobile operating systems. Used by (see elsewhere on this page) |
giellakbd-ios | 30 | 22 days ago | An open source reimplementation of Apple's native iOS keyboard with a specific focus on support for localised keyboards. Used by (see elsewhere on this page) |
giza-pp | 264 | over 1 year ago | GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models |
gv-crawl | 9 | about 10 years ago | Global Voices bitext crawler for creating parallel corpora |
GlotLID | 106 | 19 days ago | Fasttext language identification with support for more than 2000 labels |
Glottolog data | 12 | about 7 years ago | provides comprehensive reference information for the world's languages |
Gramadóir | 13 | over 1 year ago | Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources |
grind | 5 | over 4 years ago | An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin |
hermitcrab | 1 | over 2 years ago | HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach |
hfst-ospell | 13 | 10 months ago | HFST spell checker library and command line tool |
hfst-ospell-js | 0 | about 8 years ago | Node bindings for hfst-ospell |
hfst-optimized-lookup | 12 | almost 7 years ago | HFST optimized-lookup standalone library and command line tool |
hundict | 22 | over 10 years ago | bilingual dictionary extractor from parallel corpora |
hunspell | 2,171 | 17 days ago | Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding |
huntag | 22 | almost 9 years ago | a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models |
icu-dotnet | 62 | 7 days ago | C# wrapper for ICU4C |
icu4c | 6 | over 6 years ago | Mirror of svn project at . The FieldWorks branch has some FieldWorks specific enhancements |
iLanguage | 21 | about 7 years ago | A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics |
ipa-help | 0 | almost 7 years ago | IPA Helps |
itweets-geodata | 0 | almost 4 years ago | Geodata from Indigenous Tweets |
jQuery.ime | 174 | 21 days ago | jQuery based input methods library |
kbdgen | 16 | 8 months ago | Generate keyboards and keyboard layouts for various operating systems |
koreksyon | 3 | over 9 years ago | Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages |
l20n.js | 902 | over 5 years ago | L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n. |
langid.py | 2,328 | almost 5 years ago | Stand-alone language identification system |
langtech | A host of resources provided in SVN by the University of Tromsø. Details are and in English | ||
LEGO Unified Concepticon | 0 | over 11 years ago | Material relating to the LEGO Unified Concepticon |
Lex4All | 21 | over 4 years ago | pronunciation LEXicons for Any Low-resource Language |
lexdb | LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework | ||
LfMerge | 2 | 12 days ago | Send/Receive for languageforge.org |
liblevenshtein | 67 | about 4 years ago | A library for generating Finite State Transducers based on Levenshtein Automata |
libpalaso | 44 | 7 days ago | Palaso Library: A set of .Net libraries useful for developers of Language Software |
LinGO Grammar Matrix | The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages | ||
Lingpy | 126 | about 1 year ago | LingPy: Python library for quantitative tasks in historical linguistics |
Linguistica | Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed | ||
long-press | 305 | about 5 years ago | jQuery plugin to ease the writing of accented or rare characters. |
low-resource-pos-tagging-2014 | 9 | almost 9 years ago | Low-Resource POS-Tagging: 2014 |
lrl | 2 | over 11 years ago | For work concerning low resource languages |
MacVoikko | 6 | almost 10 years ago | An OS X spelling server based on Voikko |
Machine | 28 | 8 days ago | Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx) |
Make-extensions | 6 | about 7 years ago | Scripts for generating hunspell spellchecking extensions |
mgiza | 161 | over 3 years ago | A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training |
Minority Translate | Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions | ||
morfessor | 186 | about 4 years ago | Morfessor is a tool for unsupervised and semi-supervised morphological segmentation |
morpholm | 3 | over 11 years ago | Morphology-aware language models |
morph-test | 2 | almost 4 years ago | A python script to run tests for generation and analysis of a morphological transducer built using the Giella infrastructure. Works with Hfst, Xerox' fst tools, and with Foma |
mosesdecoder | 1,585 | 6 months ago | Moses, the machine translation system |
moz-l10n-tiers | 0 | about 11 years ago | Creates a pseudo-locale to evaluate string prioritization for l10n |
mukurtucms | 84 | 13 days ago | The Mukurtu Content Management System (CMS) is an Internet- based platform designed to enable archiving of digital cultural resources |
mythes | 40 | over 1 year ago | MyThes is a simple thesaurus that uses a structured text data file and an index file with binary search to lookup words and phrases and return information on part of speech, meanings, and synonyms |
myWorkSafe | 1 | over 6 years ago | Smart & Simple Backup for Language Development Workers. |
nabu | 19 | about 6 hours ago | nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items |
Natural | 10,670 | 4 months ago | general natural language facilities for node |
NIST 2008 Open Machine Translation Evalutation | |||
NLTK | 13,694 | about 1 month ago | Natural Language Tool Kit. NLTK Source |
node-panlex | 6 | over 5 years ago | node.js client for PanLex |
norma | 20 | almost 4 years ago | A tool for automatic spelling normalization |
nplm | 14 | over 9 years ago | Fork of with some efficiency tweaks and adaptation for use in mosesdecoder |
octothorpe | 0 | over 11 years ago | CouchDB-powered wiki thing |
OdtXslt | 2 | over 7 years ago | Perform XSLT transform on contents of a package (such as ODT, Docx, etc.) |
old-webapp | 4 | almost 10 years ago | Online Linguistic Database --- software for creating web applications to collaboratively document languages. |
old | 1 | over 4 years ago | The Online Linguistic Database (OLD): software for linguistic fieldwork. |
old-pyramid | 8 | over 1 year ago | Online Linguistic Database migrated to the Pyramid framework |
OmegaT-hfst-tokenizer | 2 | over 4 years ago | OmegaT-hfst-tokenizer provides fst-based tokenisation in OmegaT |
OpenDataKit | Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions | ||
OpenNLP | 1,449 | 4 days ago | The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. |
ops-devbox | 8 | over 1 year ago | Ansible playbook for a (linux) developer machine |
panlex-tools | 8 | about 2 years ago | This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at |
pdsc-collection-viewer | 4 | about 2 years ago | Paradisec Collection Browser |
paradigm | 1 | about 4 years ago | PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program" |
pathway | 7 | 7 months ago | Preparing language data for publication |
pdfdroplet | 7 | 9 days ago | Library and GUI for imposition of PDF pages (e.g. 2-up) |
pepper | 23 | 22 days ago | Pepper is a pluggable, Java-based, open source converter framework for linguistic data |
phonology-assistant | 10 | about 2 years ago | Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language |
pressagio | 19 | about 5 years ago | Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string |
PrimerPro | 1 | about 6 years ago | The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language |
pyDelphin | 80 | 3 months ago | Python libraries for DELPH-IN (Friendly Fork) |
RBGParser | 46 | almost 9 years ago | Graph-based Dependency Parser |
Rosetta Pangloss | 0 | almost 10 years ago | The Rosetta Project's Pangloss system |
salm | 11 | almost 7 years ago | SALM: Suffix Array and its Applications in Empirical Language Processing by Joy |
Salt | 15 | over 1 year ago | A graph-based model to store and manipulate linguistic data |
saymore | 6 | 8 days ago | A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription |
Secwepemc-Facebook | 13 | almost 10 years ago | Translate Facebook into unsupported languages |
SegParser | 9 | about 9 years ago | Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing |
SeedLing | 11 | almost 7 years ago | Building and Using A Seed Corpus for the Human Language Project |
Skype in your language | 3 | about 9 years ago | Translate Skype into unsupported languages |
solid | 1 | 21 days ago | Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data |
SPHERE Conversion Tools | Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats | ||
StandardFormatLib | 0 | over 9 years ago | Standard Format Library |
Stanford CoreNLP | 9,719 | 14 days ago | Stanford CoreNLP: A Java suite of core NLP tools. |
Stanford CoreNLP Python | 612 | almost 7 years ago | Python wrapper for Stanford CoreNLP tools |
stanza | 7,315 | about 18 hours ago | Stanford NLP group's shared Python tools |
str2ipa | 10 | about 9 years ago | Pronunciation dictionaries for languages with close-to-phonetic writing systems |
sugali | 2 | over 2 years ago | This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages |
SuGarLike | 1 | over 10 years ago | Language Identification for Low Resource Languages (by Susanne, Guy and Liling) |
SyllabiPy | 44 | almost 2 years ago | Python interface for universal syllabification algorithms |
tasty-imitation-keyboard | A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies! | ||
TECkit | 18 | 10 months ago | A Text Encoding Conversion toolkit |
teny | 3 | about 12 years ago | Tools for low-resource machine translation |
TeraDict | 6 | over 5 years ago | Translate English words into hundreds of languages! |
Tesseract.js | 35,553 | 9 days ago | Pure Javascript OCR for 62 Languages 📖🎉🖥 |
TexNLP | 14 | almost 13 years ago | TexNLP: Texas Natural Language Processing tools |
TiMBL | TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases | ||
Toney | 5 | about 10 years ago | Tone Classification Software |
Field Linguist's Toolbox | Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data | ||
Toolbox Scripts for ELAN | 0 | almost 10 years ago | Mirror of Alexander Koenig's Toolbox Scripts |
ToolsForFieldLinguistics | 9 | almost 6 years ago | A collection of scripts and recipes for linguistics |
transcriber | 2 | over 9 years ago | An HTML5 transcription tool for Aikuma |
translitit-engine | 2 | over 6 years ago | A transliteration engine written in JavaScript |
Tsammalex data | 6 | over 6 years ago | is a multilingual lexical database on plants and animals |
tweet2learn | 3 | almost 6 years ago | An app to make it easier to use your native language on Twitter |
twitter_langid | 15 | almost 8 years ago | A hierarchical character-word neural network for language identification |
UniversalDependencies docs | 275 | 7 days ago | Universal Dependencies online documentation |
UniversalDependencies tools | 207 | 8 days ago | Various utilities for processing the data |
VocBench | VocBench is a web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL | ||
wavesurfer.js | 8,890 | 3 days ago | Navigable waveform built on Web Audio and Canvas (Also has an ELAN plugin) |
web-template | 3 | almost 10 years ago | This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases |
webcorpus | 8 | over 9 years ago | This project is a collection of scripts and programs for creating a webcorpus from crawled data |
wikt2dict | 53 | over 2 years ago | Wiktionary parser tool for many language editions |
wikipron | 323 | 28 days ago | -- retrives IPA pronunciations for Wiktionary entries |
Word Generator | WordGenerator generates hypothetical words from specifications of their syllable structure | ||
WordBoundary | An experiment in the detection and segmentation of word boundaries | ||
wordbyword | 1 | about 10 years ago | WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages |
WSI4URLang | 0 | about 4 years ago | Word Sense Induction (WSI) for Under-resourced Languages (URLang) |
XDXF_Makedict | 227 | 7 months ago | XDXF dictionary format and "makedict" dictionary converting software (official repository) |
Keyboard Layout Configuration Helpers | |||
jQuery.IME | 174 | 21 days ago | jQuery Input Method Editor used on Wikipedia |
kbdgen | 16 | 8 months ago | Generate keyboards and keyboard layouts for Windows, macOS, X11, iOS, Android and Chrome, from a single, simple yaml file. Also registers languages unknown to Windows, so that after installation, there is a correct and robust association between the designated BCP 47 code (including full support for ISO 639-3) and installed language tools such as keyboards, spelling checkers and other tools |
Keyboard | 1,780 | over 2 years ago | Virtual Keyboard using jQuery ~ |
Keyboards | 153 | 5 days ago | Open Source Keyman keyboards |
Keyman | 405 | about 17 hours ago | Keyman cross platform input methods. Keyman makes it possible for you to type in over 1,000 languages on Windows, iPhone, iPad, Android tablets and phones, and even instantly in your web browser. |
keyboardlayouteditor | 248 | over 2 years ago | Keyboard Layout Editor |
Keyboard layout editor | 1,332 | 3 months ago | Keyboard Layout Editor |
lipika-ime | 117 | 7 months ago | Input Method Engine (IME) for Mac OS X with built-in support for all Indic Languages |
XKeyboardConfig | The non-arch keyboard configuration database for X Window. The goal is to provide the consistent, well-structured, frequently released open source of X keyboard configuration data for X Window System implementations (free, open source and commercial). The project is targeted to XKB-based systems | ||
Annotation | |||
AGTK | 0 | almost 9 years ago | AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: ) |
brendano | 8 | over 9 years ago | Graph Fragment Language for Easy Syntactic Annotation |
ELAN | ELAN is a professional tool for the creation of complex annotations on video and audio resources | ||
eopas | 9 | over 1 year ago | ETHNOER Online Presentation and Annotation System |
FLAT - FoLia Linguistic Annotation Tool | 111 | 6 months ago | FLAT is a web-based linguistic annotation environment based around the FoLiA format ( ), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure |
gfl_syntax | 8 | over 9 years ago | Graph Fragment Language for Easy Syntactic Annotation |
graf-python | 21 | over 10 years ago | The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python |
kwaras | 8 | about 1 year ago | Tools for ELAN corpus management |
LDC Word Aligner | 2 | over 6 years ago | LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources. |
poio-analyzer | 13 | about 11 years ago | Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt |
poio-api | 18 | over 6 years ago | Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F… |
pyannotation | 16 | over 12 years ago | PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files |
XTrans | Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics | ||
Format Specifications | |||
spec | 22 | almost 2 years ago | The official specification for the DLx linguistic data format. |
FoLiA | 61 | 7 months ago | FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange |
xdxf_makedict | 227 | 7 months ago | XDXF dictionary format and "makedict" dictionary converting software (official repository) |
i18n-related Repositories | |||
Express-Lingua | 66 | almost 11 years ago | An i18n middleware for the Express.js framework |
Polyglot.js | Give your JavaScript the ability to speak many languages | ||
Transifex | System for providing a nice, userfriendly/project oriented approach to translating files. Great for non-technical users, free for open-source projects, decent for minority languages; , it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared | ||
Audio automation | |||
arctic-prompts | 1 | over 8 years ago | Generate prompts PDF for CMU ARCTIC dataset |
AudioWebService | 4 | almost 2 years ago | a simple nodejs server which accepts upload of audio and runs it through praat |
AuToBI | 58 | over 5 years ago | Automatic prosodic annotation tool written in Java |
BashScriptsForPhonetics | 0 | about 11 years ago | ( of a dormant project) |
esv-text-audio-aligner | 93 | over 11 years ago | ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio |
html5-audio-read-along | 192 | about 7 years ago | HTML5 Audio Read-Along |
ipa-chart | 131 | over 3 years ago | International Phonetic Alphabet (IPA) Unicode Chart and Character Picker |
kaldi-svn-archive | 16 | over 9 years ago | An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available) |
lex4all | 1 | over 10 years ago | pronunciation LEXicons for Any Low-resource Language ( of a student project) |
Montreal-Forced-Aligner | 1,364 | 15 days ago | Python interface for forced text/speech alignment |
node-pocketsphinx | 243 | almost 6 years ago | |
opensauce | 5 | over 7 years ago | GNU Octave-compatible version of VoiceSauce |
pocketsphinx | 3,981 | 6 days ago | PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop |
pocketsphinx-ios-demo | 75 | over 6 years ago | Simple demo for iOS |
pocketsphinx-python | 338 | over 2 years ago | Python module installed with setup.py |
pocketsphinx-ruby | 13 | over 9 years ago | Ruby speech recognition with Pocketsphinx |
pocketsphinx-wp-demo | 21 | almost 9 years ago | Demo to run pocketsphinx on WP8 platform |
pocketsphinx.js | 1,493 | over 4 years ago | Speech recognition in JavaScript |
praat-py | 0 | about 12 years ago | From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. ( of a dormant project) |
Praat-Scripts | 53 | about 3 years ago | Mietta's Scripts |
PraatTextGridJS | 12 | about 3 years ago | A small library which can parse TextGrid into json and json into TextGrid |
PraatontheWeb | 39 | about 3 years ago | Web implementation of Praat. Source code, running demo scripts on web, samples and documentation |
prosodicParsing | 2 | over 12 years ago | different kinds of HMMs to use for incorporating prosody into basic parsing |
Prosodylab-Aligner | 333 | over 4 years ago | Python interface for forced audio alignment using HTK and SoX |
prosodylab.alignertools | 12 | over 9 years ago | |
Recordmp3js | 2 | over 9 years ago | Record MP3 files directly from the browser using JS and HTML |
sphinx4 | 1,411 | about 2 years ago | Pure Java speech recognition library |
sphinxbase | 527 | over 2 years ago | |
sphinxtrain | 183 | 7 days ago | |
TLSphinx | 15 | almost 6 years ago | Swift wrapper around Pocketsphinx |
Text-to-Speech (TTS) | |||
espeak | eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows. | ||
MARY TTS | 2,385 | 2 months ago | MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java |
Ossian | Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision | ||
Automatic Speech Recognition (ASR) | |||
Elpis | 152 | 7 months ago | Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers |
kaldi | 14,362 | 19 days ago | This is now the official location of the Kaldi project |
Persephone | 157 | over 1 year ago | Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis |
Text automation | |||
clld | 54 | about 2 months ago | Cross Linguistic Linked Data python library |
LaTeX2HTML5 | 61 | over 2 years ago | LaTeX web components |
MultilingualCorporaExtractor | 0 | over 11 years ago | Node io Spider for extracting multilingual corpora ( of a student project) |
SeedLing | 2 | over 10 years ago | Building and Using A Seed Corpus for the Human Language Project ( of a student project) |
Experimentation | |||
experigen | 35 | over 4 years ago | A framework for creating linguistic experiments |
GamifyPsycholinguisticsExperiments | 0 | over 12 years ago | A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. ( of a dormant project) |
OpenSesame | 242 | about 1 month ago | Graphical experiment builder for the social sciences |
OPrime | 0 | about 10 years ago | Open Source Experimentation Libraries - Online and Offline for Android and HTML5 |
psychopyMegProsody | 0 | almost 12 years ago | Runs MegProsody using PsychoPy |
PsychScript | 4 | about 10 years ago | A HTML5/Javascript library for running behavioural experiments online |
Flashcards | |||
Anki | 19,289 | 1 day ago | Anki is a program to make and share flaschard decks (including audio) for any language or writing system. |
awesome-anki | 1,649 | 2 days ago | A curated list of awesome Anki add-ons, decks and resources |
VocabLift | 3 | over 10 years ago | Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay |
Natural language generation | |||
OpenCCG | 206 | almost 4 years ago | OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others |
Computing systems | |||
Common Language Resources and Technology Infrastructure Norway / Clarino | One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the cluster | ||
Android Applications | |||
Aikuma | 30 | almost 9 years ago | Android software for recording and translation |
Android Speech Recognition Trainer | 3 | about 6 years ago | Speech recognition training app for low resource languages which interfaces with FieldDB corpora |
android-template | 0 | almost 10 years ago | This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to |
AndroidFieldDB | 3 | over 5 years ago | An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers |
AndroidFieldDBElicitationRecorder | 2 | about 11 years ago | A general purpose video recording tool |
AndroidLanguageLessons | 2 | about 6 years ago | Lets heritage speakers create self designed language lessons |
AndroidProductionExperiment | 0 | about 11 years ago | Android App to run perception experiments |
Bevara | 3 | about 11 years ago | Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages |
ojoVoz | A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to | ||
pocketsphinx-android | 235 | almost 5 years ago | pocketsphinx build for Android |
pocketsphinx-android-demo | 549 | about 6 years ago | |
Chrome Extensions | |||
babelfrog | 16 | over 5 years ago | Chrome extension to help learn languages as you browse |
DictionaryChromeExtension | 6 | almost 10 years ago | Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) |
FieldDB | |||
FieldDB | 79 | about 2 years ago | An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival. |
FieldDB / FieldDB Webservices/Components/Plugins | |||
AndroidLanguageLearningClientForFieldDB-sikuli | 0 | about 10 years ago | Sikuli tests for AndroidLanguageLearningClientForFieldDB |
AuthenticationWebService | 0 | almost 2 years ago | A node.js web service which mananges users and corpora creation and authentication |
bower-fielddb-angular | 0 | over 9 years ago | A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save |
bower-fielddb | 0 | over 4 years ago | A bower repository which hosts fielddb core components, bower install fielddb --save |
fielddb-spreadsheet-sikuli | 1 | almost 10 years ago | sikuli tests for the spreadsheet module |
FieldDBActivityFeed | 0 | almost 10 years ago | A fielddb activity feed widget which can be embedded in other codebases, websites etc |
FieldDBGlosser | 0 | almost 8 years ago | A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save |
FieldDBLexicon | 0 | about 7 years ago | A lexicon browser/editor web widget for FieldDB databases |
LanguageClassDashboard | 0 | over 10 years ago | App which provides a view of FieldDB corpora for language teachers |
LexiconWebService | 0 | over 4 years ago | A node.js ElasticSearch wrapper for indexing/training lexicons from corpora |
LexiconWebServiceSample | 1 | over 12 years ago | A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project |
Academic Research Paper-Specific Repositories | |||
Gargantua | 12 | about 9 years ago | Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010 |
ldc-kiy | 0 | over 11 years ago | Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, |
Learning to map into a Univerisal POS tagset | Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson | ||
low-resource-pos-tagging-2014 | 9 | almost 9 years ago | and Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. . In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. . In Proceedings of ACL 2013 |
orthotree | 10 | almost 10 years ago | Linguistic family tree based on orthographic distance |
type-supervised-tagging-2012emnlp | 1 | almost 9 years ago | This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. . In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit |
visualizing-language | 1 | almost 13 years ago | For visualizations of WALS and other typological databases |
WALS-APiCS | 0 | almost 10 years ago | Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics |
Example Repositories | |||
CorpusWebService | 0 | almost 3 years ago | über-simple node.js-Proxy to enable CORS request for couchdb |
CorporaForFieldLinguistics | 3 | over 7 years ago | Small corpora from diverse language typologies, useful for testing scripts |
startR | 0 | about 12 years ago | |
lucenerevolution-2013 | 0 | over 11 years ago | Demo examples for linguistics in Lucene and Solr |
berlin-buzzwords-2013 | 0 | over 11 years ago | Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk |
Fonts | |||
fontinline | 4 | over 6 years ago | Make inline stroke paths from an outline font |
Noto Fonts | 2,466 | almost 2 years ago | Noto is Google’s free font family that aims to support all the world’s scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0 |
Unicodify | Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII | ||
Corpora | |||
bible-corpus | 177 | 3 months ago | A multilingual parallel corpus created from translations of the Bible |
poio-corpus | 7 | 7 days ago | The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others |
Organizations / On GitHub | |||
batumi | Speech recognition and natural language processing for low-resource languages | ||
BloomBooks | |||
unicode-cldr | Unicode Common Locale Data Repository (CLDR) Project | ||
cmusphinx | Mirror of the SourceForge repositories | ||
dativebase | Tools for working with OLD | ||
divvun | The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages. | ||
FieldDB | |||
GiellaLT | home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source | ||
HFST | Helsinki Finite-State Technology. | ||
hunspell | |||
keymanapp | |||
langtech | Language Technology Group, University of Melbourne | ||
lex4all | |||
longnow | |||
MontrealCorpusTools | |||
moses-smt | Statistical Machine Translation | ||
mukurtucms | |||
NLTK | Natural Language Toolkit | ||
PhonologicalCorpusTools) | |||
Projet de recherche sur l'écriture | Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics) | ||
prosodylab | Prosodylab at McGill University, Canada | ||
SIL International (Dev) | Another SIL organization, with many repositories | ||
SIL International | SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects | ||
SIL NRSI | SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development | ||
StanfordNLP | |||
ucsd-field-lab | University of California, San Diego | ||
UniversalDependencies | Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary | ||
utcompling | The University of Texas at Austin's Computational Linguistics Lab. | ||
Organizations / Other OSS Organizations | |||
Giellatekno | Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found , sorted by language | ||
LOWLANDS | LOWLANDS – Parsing low-resource languages and domains | ||
LTRC: Language Technologies Research Center IIIT Hyderabad | LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above | ||
The Language Archive | Part of the MPI | ||
Tutorials | |||
How to Write a Spelling Corrector | by | ||
Language Specific Projects / Afrikaans | |||
Afrikaanse rekenaarlinguïstiek (Afrikaans computational linguistics) | — wordlists, corpora, morphological analyser, tagger, word decompounder. Available upon email | ||
Language Specific Projects / Albanian | |||
Apertium rules for Albanian | Machine Translation rules | ||
out-of-copyright-albanian-authors | authors scraped from the albanian language wikipedia who are out of copyright | ||
Plis keyboard | The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language | ||
spell checking | Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included | ||
Language Specific Projects / Alutiiq | |||
wiinaq | 2 | over 1 year ago | Word Wiinaq is a dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django |
Language Specific Projects / Amharic | |||
HornMorpho | 5 | over 9 years ago | Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs |
Language Specific Projects / Basque | |||
Matxin | An open-source transfer machine translation engine. Linguistic information for the translation from Spanish and Basque (es-eu) is included | ||
Language Specific Projects / Bengali | |||
Bangla-অঙ্কুর for Mac | This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future | ||
Bengali Writer | 1 | almost 9 years ago | `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: ) |
Ekushey | Bangla Computing and Localization Project for the Bangla speaking people | ||
Lekho | 0 | almost 9 years ago | A collection of tools and resources for using bangla on computers (Original project is on SourceForge: ) |
Language Specific Projects / Chichewa | |||
Chichewa | 9 | over 3 years ago | NLP resources for Chichewa |
Language Specific Projects / Galician | |||
an-metri-gal | 3 | 12 days ago | Análise métrico de texto en verso en lingua galega (Galician language) gl-ES |
android_gl_dict | 2 | almost 12 years ago | Android Galician (gl_ES) Keyboard Dictionary |
aspell-gl | 1 | over 12 years ago | Galician dictionary for aspell |
CitiusSentiment | 7 | over 8 years ago | Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician |
CitiusTagger | A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish | ||
Conshuga | Galician verb conjugator | ||
corpora | 2 | almost 9 years ago | This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia) |
DepPattern | 10 | over 6 years ago | Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser |
DOGA_scraper | 0 | over 10 years ago | Galician Official journal scraper |
elFinder-language | 1 | about 8 years ago | Galician - Gallego / language for elFinder |
EuroWordNetLemon | 1 | over 9 years ago | EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician |
GalegoDroid | Galician Translator for Android | ||
galeXtra | 2 | over 8 years ago | Multiword Extractor for Portuguese, English, Spanish, Galician, French |
Galician-Dependency-Treebank | 1 | about 8 years ago | This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006 |
Galician-Fuzzy-Text-watch | 1 | almost 9 years ago | Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time |
galician-locale-for-mac | 1 | almost 9 years ago | Galician locale for Mac OS X |
gl-syllabler | 1 | almost 9 years ago | Split galician language words into syllables |
gl | 1 | 4 months ago | Galician OmegaT Localisation |
hunspell-gl-ciencias | 0 | over 11 years ago | Project oriented into developing a science and maths Galician language Hunspell dictionary |
hunspell-gl | 1 | over 11 years ago | Galician hunspell dictionaries |
hyphen-gl | 1 | over 12 years ago | Galician hyphenation rules |
javagalician-java6 | 3 | over 12 years ago | The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution |
Linguakit | 65 | 10 months ago | Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc |
ParlamentoGalicia | 0 | over 11 years ago | Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament |
poss-gl | 1 | over 13 years ago | Galician translation of Producing Open Source Software, by Karl Fogel |
rima | 1 | almost 9 years ago | Find rhyming words in galician language |
stopwords-gl | 1 | about 8 years ago | Galician stopwords collection |
texlive-babel-galician | 1 | 24 days ago | TeXLive babel-galician package |
UD_Galician-CTG | 1 | about 1 month ago | The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group |
UD_Galician-TreeGal | 6 | about 1 month ago | The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña) |
UL_Galician-TreeGal | 0 | over 6 years ago | CoNLL-UL Repository for UD_Galician-TreeGal |
Language Specific Projects / Galician / Apertium | |||
apertium-cat-glg | 1 | over 2 years ago | Apertium translation pair for Catalan and Galician |
apertium-dict-en-gl | 1 | almost 9 years ago | English-Galician language pair for Apertium |
apertium-dict-es-gl | 1 | almost 9 years ago | Spanish-Galician language pair for Apertium |
apertium-dict-pt-gl | 1 | over 11 years ago | Portuguese-Galician language pair for Apertium |
apertium-en-gl | 0 | over 2 years ago | Apertium translation pair for English and Galician |
apertium-es-gl | 1 | over 3 years ago | Apertium translation pair for Spanish and Galician |
apertium-glg | 0 | over 2 years ago | Apertium linguistic data for Galician |
Apertium-pt-gl.pt-gl-LMF | 0 | over 10 years ago | This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages |
apertium-pt-gl | 0 | over 3 years ago | Apertium translation pair for Portuguese and Galician |
Language Specific Projects / Georgian | |||
awesome-georgia | 89 | over 1 year ago | A curated list of awesome libraries and packages specific/related to Georgia (country) |
Gadatsqvetilebebi | 1 | over 7 years ago | გადაწყვეტილებები; Web spider and corpora importer for public legal decisions |
GeoWordsDatabase | 70 | about 7 years ago | Around 310 000 unique Georgian words |
Kartuli Speech Recognition | 4 | almost 7 years ago | ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes." |
KartuliChromeExtension | 1 | over 10 years ago | Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად |
QartuliDaBunebismetkveleba | 1 | about 11 years ago | მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის |
SakartvelosUzenaesiSasamartloSarke | 0 | over 10 years ago | საქართველოს უზენაესი სასამართლო სარკე |
SamartlosSakonstitutsioSasamartdoSarke | 0 | over 7 years ago | სამართლოს საკონსტიტუციო სასამართდო სარკე |
translitit-latin-to-mkhedruli-georgian | 4 | over 7 years ago | A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript |
translitit-mkhedruli-georgian-to-ipa | 0 | over 7 years ago | A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript |
Declensions | 2 | about 2 years ago | Methods to generate declensions for Georgian language |
Language Specific Projects / Georgian / Fonts | |||
Stichoza/font-larisome | 39 | over 3 years ago | Iconic font for Georgian currency inspired by Font-Awesome (CSS) |
Lotuashvili/BPGNateli | 0 | over 9 years ago | Bower package for BPG Nateli font (CSS) |
thecotne/georgian-webfonts | 17 | over 7 years ago | Package for georgian fonts (CSS) |
Language Specific Projects / Georgian / Internationalization and Localization (i18n/l10n) | |||
Stichoza/money-num-to-string | 8 | 12 months ago | Convert a number/money to localized string (PHP, JavaScript) |
natchkebiailia/NumberToWord | 3 | about 7 years ago | Convert numbers to localized strings (JavaScript) |
d0ragon/number-to-words-ka | 3 | over 10 years ago | Convert numbers to localized strings (PHP) |
dimakura/ka | 0 | about 11 years ago | Common functionality for georgian projects (Ruby) |
dimakura/ka.js | 5 | over 10 years ago | Georgian language support for node and browser (JavaScript) |
akalongman/kautilities | 4 | over 8 years ago | Convert Georgian letters to Latin and vice-versa (PHP) |
Landish/Laravel-Ka | 8 | over 3 years ago | Georgian Language Pack |
Landish/RedactorJS-GE | Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript) | ||
wenzhixin/bootstrap-table | 11,746 | 4 days ago | Bootstrap table with extra features. l10n by and |
moment/moment | 48,013 | 4 months ago | A lightweight date library (JavaScript) |
ioseb/geokbd | 57 | about 15 years ago | Georgian keyboard library (JavaScript) |
Language Specific Projects / Guarani | |||
ParaMorfo | 5 | over 9 years ago | morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives |
Language Specific Projects / Hausa | |||
Hausa | 6 | over 9 years ago | Repository for Hausa NLP tools |
Language Specific Projects / Hindi | |||
hindi-morph | 0 | over 11 years ago | An open source morphological analyzer for Hindi |
Language Specific Projects / Høgnorsk | |||
hunspell-hn_NO | A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses | ||
Language Specific Projects / Icelandic | |||
IceNLP | 21 | 10 months ago | IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java |
Language Specific Projects / Inuktitut | |||
InuktitutAlignerData | 3 | over 12 years ago | Scripts for alignment of laboratory speech production data |
InuktitutComputing | 10 | over 9 years ago | Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at |
Language Specific Projects / Irish | |||
aimsigh | 1 | over 1 year ago | Source for the now-defunct aimsigh.com Irish search engine |
caighdean | 18 | 3 months ago | Code for standardizing Irish language text |
fleiscin | 1 | about 4 years ago | Irish hyphenation patterns for TeX |
GaelSpell | 17 | 25 days ago | Sources for an Irish language spell checker |
tesseract-gle-uncial | 4 | over 9 years ago | OCR for old Irish fonts |
Language Specific Projects / Kinyarwanda | |||
kin-morph-fst | 6 | over 11 years ago | Kinyarwanda morphological analyzer |
TurboTagger & TurboParser for Kinyarwanda (download) | TurboTagger & TurboParser for Kinyarwanda | ||
Language Specific Projects / Kurdish | |||
Kurlex | Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR | ||
kurmanji-stemmer | 1 | over 9 years ago | NLTK based kurmanji stemmer |
Language Specific Projects / Lingala | |||
Lingala NLP | NLP tools and resources for Lingala | ||
Language Specific Projects / Lushootseed | |||
Lushootseed | 0 | over 8 years ago | Joshua Crowgey's work on Lushootseed |
Language Specific Projects / Malay | |||
MorfoMalayu | 5 | over 9 years ago | morphological analysis of Malay words |
Language Specific Projects / Malagasy | |||
Global Voices Malagasy Project | This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau | ||
Language Specific Projects / Manx | |||
aspell-gv | 1 | over 12 years ago | Manx Gaelic dictionary for aspell |
gaelg | 3 | 4 months ago | NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine |
Language Specific Projects / Migmaq | |||
migmaq-lessons | 1 | over 9 years ago | Repository for website building Mi'gmaq language lessons |
Language Specific Projects / Minderico | |||
fredericajordarzambarino | 0 | over 10 years ago | A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show |
Language Specific Projects / Nishnaabe | |||
Ojibway-iphone-app | 0 | over 9 years ago | An iPhone app with audio and images for learning the Ojibway language |
OjibwayMap | 1 | over 9 years ago | An iPhone app with audio and images for learning Ojibway language and culture |
nishanimate | 1 | over 9 years ago | A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text |
Language Specific Projects / Oromo | |||
hornmorpho | 5 | over 9 years ago | morphological analysis and generation of amharic and oromo verbs and nouns. and tigrinya verbs |
Language Specific Projects / Quechua | |||
AntiMorfo | 5 | over 9 years ago | morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs |
Morphology, spellchecker | XFST and FOMA, plus OpenOffice plugin | ||
Language Specific Projects / Sami | |||
divvun-webdemo | 2 | over 1 year ago | simple webdemo for divvun grammar checker. |
Giellatekno | A host of Sámi tools | ||
Oahpa! | A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools | ||
Neahttadigisánit | A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography ( will be recognized also as ), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for . Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness: | ||
Language Specific Projects / Scottish Gaelic | |||
aspell-gd | 1 | over 12 years ago | Scottish Gaelic dictionary for aspell |
briathrachan | 2 | about 8 years ago | This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS |
gaidhlig | 3 | over 1 year ago | NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines |
gd-fcfg | 3 | over 12 years ago | Context-free feature-based grammar of Scottish Gaelic in the NLTK format |
gdbank | 4 | 2 days ago | Some tools and resources for natural language processing of Scottish Gaelic. |
hunspell-gd | 10 | almost 2 years ago | Files for building Scottish Gaelic spell checkers |
Language Specific Projects / Secwepemctsín | |||
secwepemctsnem | 2 | about 14 years ago | A project to help people learn Secwepemctsín |
Language Specific Projects / Somali | |||
somorph | Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on repository | ||
qaamuus.net | morphologically aware dictionary based on lexical resources found online, and the somali morphology | ||
Language Specific Projects / Tigrinya | |||
HornMorpho | 5 | over 9 years ago | morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs |
Language Specific Projects / Uralic | |||
UralicNLP | 71 | 17 days ago | A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time |
Language Specific Projects / Zulu | |||
Ukwabelana | An open-source morphological Zulu corpus |