low-resource-languages

Language preservation toolkit

A repository of tools and resources to support the documentation, conservation, and development of endangered languages.

Resources for conservation, development, and documentation of low resource (human) languages.

GitHub

393 stars

35 watching

56 forks

Language: TeX

last commit: about 2 years ago

Linked from 3 awesome lists

awesomeawesome-listendangered-languageshuman-languagelanguage-documentationlanguage-learninglanguage-resourceslistlow-resource-languageslrlsminority-languagenatural-languagenatural-language-processingnlpresourced-languages

Generic Repositories / Single language lexicography projects and utilities / Utilities
Project for Free Electronic Dictionaries			Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries
Webonary			Site which hosts digital dictionaries for single languages
WeSay	18	over 1 year ago	Allows language communities to build their own dictionaries. (by the SIL International)
Generic Repositories / Software
4lang	37	over 2 years ago	Concept dictionary using Eilenberg machines
accentuate.us			a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
alignment-with-openfst	21	over 9 years ago	This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing
Apertium			Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs
ark-tweet-nlp	0	almost 14 years ago	CMU ARK Twitter Part-of-Speech Tagger ( )
ArtOfReading	1	over 6 years ago	Index and processing scripts related to the Art Of Reading illustration collection
bayesline	0	about 9 years ago	A Multinomial Bayesian Classification for Language Identification
bible-corpus-tools	15	almost 4 years ago	A collection of tools for reading/processing the multilingual Bible corpus
BloomDesktop	39	over 1 year ago	Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia…
BloomLibrary	4	over 5 years ago	Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend.
brain	1	over 12 years ago	Neural networks in JavaScript
Bristol Uni MT Morphology tools	2	over 10 years ago	This repo is a mirror of scripts previously available on . Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis
brown-cluster	425	almost 3 years ago	C++ implementation of the Brown word clustering algorithm
CasualCon			CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count
cdec	183	about 6 years ago	Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
charlint			Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model
chorus	7	over 1 year ago	A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed
clam	130	over 2 years ago	Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice
CMU Sphinx			CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems
cnminlangwebcollect	1	almost 6 years ago	Chinese minorities website languages detection and websites collection
Cog	23	almost 3 years ago	Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties.
convertextract	11	almost 3 years ago	Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting
CorpusTools	115	almost 2 years ago	Phonological CorpusTools
CTK	18	over 10 years ago	Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: )
DataTags	0	over 11 years ago	A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. ( )
dataverse	894	over 1 year ago	A data repository framework to share and publish research data
Dative	14	over 3 years ago	Dative: software for linguistic fieldwork
dative	14	over 3 years ago	A single-page application that interacts with multiple linguistic fieldwork web service databases.
DeepLearnToolbox	0	about 12 years ago	Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started
Desmeme	4	over 2 years ago	Database and tools for exploring linguistic templates
dictdb			dictionary database for language translation
discoursegraphs	50	over 3 years ago	Python-based tool to convert and merge multilayer annotated linguistic data
divvun-gramcheck	9	over 1 year ago	This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline
divvun-keyboard	6	almost 2 years ago	keyboard apps for iOS and Android with keyboard layouts for indigenous and minority languages
divvunspell	14	over 1 year ago	(below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster than . It uses the same zhfst files as , which are available for all languages in the GitHub org (see below)
DLTK	12	almost 11 years ago	Deutsch Language Tool Kit.
epitran	668	almost 2 years ago	Grapheme to Phoneme conversion (G2P) for many low-resource languages
ELDER: Endangered Language Data Electronic Repository	4	over 14 years ago	Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool
enchant	1	almost 2 years ago	enchant spellchecking library
exsite9	7	over 2 years ago	ExSite9 is a desktop application that was built to facilitate researchers easily and quickly tagging their data files with descriptive metadata and subsequently packaging their data files and associated metadata ready for submission to a repository. ExSite9 also allows for the structural organisation of said files within actually moving their physical location on your local file storage; allowing you to correctly organise your files and metadata ready for packaging
fast_align	740	about 4 years ago	Simple, fast unsupervised word aligner
fastText	25,979	over 2 years ago	Library for fast text representation and classification
FieldWorks	86	over 1 year ago	FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology
Franc	4,158	about 2 years ago	Natural language detection
FwDocumentation	8	over 1 year ago	Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts)
FwLocalizations	0	about 2 years ago	Localizations for FieldWorks
FwSupportTools	2	over 2 years ago	Additional tools for FieldWorks development
Gaia	2,096	about 5 years ago	Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see . If you're interested in setting up a keyboard in new language, see
giellakbd-android	12	almost 2 years ago	A fork of LatinIME (by Google for Android), targeting marginalised languages that also deserve first-class status on mobile operating systems. Used by (see elsewhere on this page)
giellakbd-ios	30	over 1 year ago	An open source reimplementation of Apple's native iOS keyboard with a specific focus on support for localised keyboards. Used by (see elsewhere on this page)
giza-pp	264	over 3 years ago	GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models
gv-crawl	9	almost 12 years ago	Global Voices bitext crawler for creating parallel corpora
GlotLID	106	over 1 year ago	Fasttext language identification with support for more than 2000 labels
Glottolog data	12	over 8 years ago	provides comprehensive reference information for the world's languages
Gramadóir	13	almost 3 years ago	Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources
grind	5	about 6 years ago	An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin
hermitcrab	1	about 4 years ago	HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach
hfst-ospell	13	over 2 years ago	HFST spell checker library and command line tool
hfst-ospell-js	0	over 9 years ago	Node bindings for hfst-ospell
hfst-optimized-lookup	12	over 8 years ago	HFST optimized-lookup standalone library and command line tool
hundict	22	about 12 years ago	bilingual dictionary extractor from parallel corpora
hunspell	2,171	over 1 year ago	Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding
huntag	22	over 10 years ago	a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
icu-dotnet	62	over 1 year ago	C# wrapper for ICU4C
icu4c	6	about 8 years ago	Mirror of svn project at . The FieldWorks branch has some FieldWorks specific enhancements
iLanguage	21	over 8 years ago	A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics
ipa-help	0	over 8 years ago	IPA Helps
itweets-geodata	0	over 5 years ago	Geodata from Indigenous Tweets
jQuery.ime	175	over 1 year ago	jQuery based input methods library
kbdgen	16	over 2 years ago	Generate keyboards and keyboard layouts for various operating systems
koreksyon	3	almost 11 years ago	Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages
l20n.js	902	over 7 years ago	L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n.
langid.py	2,328	over 6 years ago	Stand-alone language identification system
langtech			A host of resources provided in SVN by the University of Tromsø. Details are and in English
LEGO Unified Concepticon	0	about 13 years ago	Material relating to the LEGO Unified Concepticon
Lex4All	21	about 6 years ago	pronunciation LEXicons for Any Low-resource Language
lexdb			LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework
LfMerge	2	over 1 year ago	Send/Receive for languageforge.org
liblevenshtein	67	almost 6 years ago	A library for generating Finite State Transducers based on Levenshtein Automata
libpalaso	44	over 1 year ago	Palaso Library: A set of .Net libraries useful for developers of Language Software
LinGO Grammar Matrix			The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages
Lingpy	126	over 2 years ago	LingPy: Python library for quantitative tasks in historical linguistics
Linguistica			Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed
long-press	305	almost 7 years ago	jQuery plugin to ease the writing of accented or rare characters.
low-resource-pos-tagging-2014	9	over 10 years ago	Low-Resource POS-Tagging: 2014
lrl	2	about 13 years ago	For work concerning low resource languages
MacVoikko	6	over 11 years ago	An OS X spelling server based on Voikko
Machine	28	over 1 year ago	Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx)
Make-extensions	6	over 8 years ago	Scripts for generating hunspell spellchecking extensions
mgiza	161	about 5 years ago	A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training
Minority Translate			Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions
morfessor	186	almost 6 years ago	Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
morpholm	3	about 13 years ago	Morphology-aware language models
morph-test	2	over 5 years ago	A python script to run tests for generation and analysis of a morphological transducer built using the Giella infrastructure. Works with Hfst, Xerox' fst tools, and with Foma
mosesdecoder	1,585	about 2 years ago	Moses, the machine translation system
moz-l10n-tiers	0	over 12 years ago	Creates a pseudo-locale to evaluate string prioritization for l10n
mukurtucms	84	over 1 year ago	The Mukurtu Content Management System (CMS) is an Internet- based platform designed to enable archiving of digital cultural resources
mythes	40	about 3 years ago	MyThes is a simple thesaurus that uses a structured text data file and an index file with binary search to lookup words and phrases and return information on part of speech, meanings, and synonyms
myWorkSafe	1	almost 8 years ago	Smart & Simple Backup for Language Development Workers.
nabu	19	over 1 year ago	nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items
Natural	10,670	almost 2 years ago	general natural language facilities for node
NIST 2008 Open Machine Translation Evalutation
NLTK	13,694	over 1 year ago	Natural Language Tool Kit. NLTK Source
node-panlex	6	over 7 years ago	node.js client for PanLex
norma	20	over 5 years ago	A tool for automatic spelling normalization
nplm	14	almost 11 years ago	Fork of with some efficiency tweaks and adaptation for use in mosesdecoder
octothorpe	0	about 13 years ago	CouchDB-powered wiki thing
OdtXslt	2	about 9 years ago	Perform XSLT transform on contents of a package (such as ODT, Docx, etc.)
old-webapp	4	over 11 years ago	Online Linguistic Database --- software for creating web applications to collaboratively document languages.
old	1	almost 6 years ago	The Online Linguistic Database (OLD): software for linguistic fieldwork.
old-pyramid	8	about 3 years ago	Online Linguistic Database migrated to the Pyramid framework
OmegaT-hfst-tokenizer	2	over 6 years ago	OmegaT-hfst-tokenizer provides fst-based tokenisation in OmegaT
OpenDataKit			Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions
OpenNLP	1,449	over 1 year ago	The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
ops-devbox	8	about 3 years ago	Ansible playbook for a (linux) developer machine
panlex-tools	8	over 3 years ago	This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at
pdsc-collection-viewer	4	almost 4 years ago	Paradisec Collection Browser
paradigm	1	over 5 years ago	PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program"
pathway	7	about 2 years ago	Preparing language data for publication
pdfdroplet	7	over 1 year ago	Library and GUI for imposition of PDF pages (e.g. 2-up)
pepper	23	over 1 year ago	Pepper is a pluggable, Java-based, open source converter framework for linguistic data
phonology-assistant	10	over 3 years ago	Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language
pressagio	19	over 6 years ago	Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string
PrimerPro	1	almost 8 years ago	The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language
pyDelphin	80	almost 2 years ago	Python libraries for DELPH-IN (Friendly Fork)
RBGParser	46	over 10 years ago	Graph-based Dependency Parser
Rosetta Pangloss	0	over 11 years ago	The Rosetta Project's Pangloss system
salm	11	over 8 years ago	SALM: Suffix Array and its Applications in Empirical Language Processing by Joy
Salt	15	over 3 years ago	A graph-based model to store and manipulate linguistic data
saymore	6	over 1 year ago	A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription
Secwepemc-Facebook	13	over 11 years ago	Translate Facebook into unsupported languages
SegParser	9	over 10 years ago	Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
SeedLing	11	over 8 years ago	Building and Using A Seed Corpus for the Human Language Project
Skype in your language	3	over 10 years ago	Translate Skype into unsupported languages
solid	1	over 1 year ago	Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data
SPHERE Conversion Tools			Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats
StandardFormatLib	0	over 11 years ago	Standard Format Library
Stanford CoreNLP	9,727	over 1 year ago	Stanford CoreNLP: A Java suite of core NLP tools.
Stanford CoreNLP Python	612	over 8 years ago	Python wrapper for Stanford CoreNLP tools
stanza	7,315	over 1 year ago	Stanford NLP group's shared Python tools
str2ipa	10	almost 11 years ago	Pronunciation dictionaries for languages with close-to-phonetic writing systems
sugali	2	about 4 years ago	This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages
SuGarLike	1	about 12 years ago	Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
SyllabiPy	44	over 3 years ago	Python interface for universal syllabification algorithms
tasty-imitation-keyboard			A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies!
TECkit	18	over 2 years ago	A Text Encoding Conversion toolkit
teny	3	almost 14 years ago	Tools for low-resource machine translation
TeraDict	6	about 7 years ago	Translate English words into hundreds of languages!
Tesseract.js	35,553	over 1 year ago	Pure Javascript OCR for 62 Languages 📖🎉🖥
TexNLP	14	over 14 years ago	TexNLP: Texas Natural Language Processing tools
TiMBL			TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases
Toney	5	almost 12 years ago	Tone Classification Software
Field Linguist's Toolbox			Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data
Toolbox Scripts for ELAN	0	over 11 years ago	Mirror of Alexander Koenig's Toolbox Scripts
ToolsForFieldLinguistics	9	over 7 years ago	A collection of scripts and recipes for linguistics
transcriber	2	about 11 years ago	An HTML5 transcription tool for Aikuma
translitit-engine	2	over 8 years ago	A transliteration engine written in JavaScript
Tsammalex data	6	about 8 years ago	is a multilingual lexical database on plants and animals
tweet2learn	3	over 7 years ago	An app to make it easier to use your native language on Twitter
twitter_langid	15	over 9 years ago	A hierarchical character-word neural network for language identification
UniversalDependencies docs	275	over 1 year ago	Universal Dependencies online documentation
UniversalDependencies tools	207	over 1 year ago	Various utilities for processing the data
VocBench			VocBench is a web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL
wavesurfer.js	8,890	over 1 year ago	Navigable waveform built on Web Audio and Canvas (Also has an ELAN plugin)
web-template	3	over 11 years ago	This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases
webcorpus	8	over 11 years ago	This project is a collection of scripts and programs for creating a webcorpus from crawled data
wikt2dict	53	almost 4 years ago	Wiktionary parser tool for many language editions
wikipron	323	over 1 year ago	-- retrives IPA pronunciations for Wiktionary entries
Word Generator			WordGenerator generates hypothetical words from specifications of their syllable structure
WordBoundary			An experiment in the detection and segmentation of word boundaries
wordbyword	1	almost 12 years ago	WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages
WSI4URLang	0	almost 6 years ago	Word Sense Induction (WSI) for Under-resourced Languages (URLang)
XDXF_Makedict	228	about 2 years ago	XDXF dictionary format and "makedict" dictionary converting software (official repository)
Keyboard Layout Configuration Helpers
jQuery.IME	175	over 1 year ago	jQuery Input Method Editor used on Wikipedia
kbdgen	16	over 2 years ago	Generate keyboards and keyboard layouts for Windows, macOS, X11, iOS, Android and Chrome, from a single, simple yaml file. Also registers languages unknown to Windows, so that after installation, there is a correct and robust association between the designated BCP 47 code (including full support for ISO 639-3) and installed language tools such as keyboards, spelling checkers and other tools
Keyboard	1,780	almost 4 years ago	Virtual Keyboard using jQuery ~
Keyboards	153	over 1 year ago	Open Source Keyman keyboards
Keyman	405	over 1 year ago	Keyman cross platform input methods. Keyman makes it possible for you to type in over 1,000 languages on Windows, iPhone, iPad, Android tablets and phones, and even instantly in your web browser.
keyboardlayouteditor	248	about 4 years ago	Keyboard Layout Editor
Keyboard layout editor	1,332	almost 2 years ago	Keyboard Layout Editor
lipika-ime	117	about 2 years ago	Input Method Engine (IME) for Mac OS X with built-in support for all Indic Languages
XKeyboardConfig			The non-arch keyboard configuration database for X Window. The goal is to provide the consistent, well-structured, frequently released open source of X keyboard configuration data for X Window System implementations (free, open source and commercial). The project is targeted to XKB-based systems
Annotation
AGTK	0	over 10 years ago	AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: )
brendano	8	about 11 years ago	Graph Fragment Language for Easy Syntactic Annotation
ELAN			ELAN is a professional tool for the creation of complex annotations on video and audio resources
eopas	9	about 3 years ago	ETHNOER Online Presentation and Annotation System
FLAT - FoLia Linguistic Annotation Tool	111	about 2 years ago	FLAT is a web-based linguistic annotation environment based around the FoLiA format ( ), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure
gfl_syntax	8	about 11 years ago	Graph Fragment Language for Easy Syntactic Annotation
graf-python	21	about 12 years ago	The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python
kwaras	8	over 2 years ago	Tools for ELAN corpus management
LDC Word Aligner	2	over 8 years ago	LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources.
poio-analyzer	13	almost 13 years ago	Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt
poio-api	18	about 8 years ago	Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F…
pyannotation	16	almost 14 years ago	PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files
XTrans			Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics
Format Specifications
spec	22	over 3 years ago	The official specification for the DLx linguistic data format.
FoLiA	61	about 2 years ago	FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange
xdxf_makedict	228	about 2 years ago	XDXF dictionary format and "makedict" dictionary converting software (official repository)
i18n-related Repositories
Express-Lingua	66	over 12 years ago	An i18n middleware for the Express.js framework
Polyglot.js			Give your JavaScript the ability to speak many languages
Transifex			System for providing a nice, userfriendly/project oriented approach to translating files. Great for non-technical users, free for open-source projects, decent for minority languages; , it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared
Audio automation
arctic-prompts	1	over 10 years ago	Generate prompts PDF for CMU ARCTIC dataset
AudioWebService	4	over 3 years ago	a simple nodejs server which accepts upload of audio and runs it through praat
AuToBI	58	over 7 years ago	Automatic prosodic annotation tool written in Java
BashScriptsForPhonetics	0	almost 13 years ago	( of a dormant project)
esv-text-audio-aligner	93	over 13 years ago	ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio
html5-audio-read-along	192	almost 9 years ago	HTML5 Audio Read-Along
ipa-chart	131	over 5 years ago	International Phonetic Alphabet (IPA) Unicode Chart and Character Picker
kaldi-svn-archive	16	about 11 years ago	An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available)
lex4all	1	about 12 years ago	pronunciation LEXicons for Any Low-resource Language ( of a student project)
Montreal-Forced-Aligner	1,364	over 1 year ago	Python interface for forced text/speech alignment
node-pocketsphinx	243	over 7 years ago
opensauce	5	about 9 years ago	GNU Octave-compatible version of VoiceSauce
pocketsphinx	3,981	over 1 year ago	PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop
pocketsphinx-ios-demo	75	about 8 years ago	Simple demo for iOS
pocketsphinx-python	338	about 4 years ago	Python module installed with setup.py
pocketsphinx-ruby	13	about 11 years ago	Ruby speech recognition with Pocketsphinx
pocketsphinx-wp-demo	21	over 10 years ago	Demo to run pocketsphinx on WP8 platform
pocketsphinx.js	1,493	over 6 years ago	Speech recognition in JavaScript
praat-py	0	over 13 years ago	From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. ( of a dormant project)
Praat-Scripts	53	over 4 years ago	Mietta's Scripts
PraatTextGridJS	12	over 4 years ago	A small library which can parse TextGrid into json and json into TextGrid
PraatontheWeb	39	almost 5 years ago	Web implementation of Praat. Source code, running demo scripts on web, samples and documentation
prosodicParsing	2	about 14 years ago	different kinds of HMMs to use for incorporating prosody into basic parsing
Prosodylab-Aligner	333	about 6 years ago	Python interface for forced audio alignment using HTK and SoX
prosodylab.alignertools	12	over 11 years ago
Recordmp3js	2	almost 11 years ago	Record MP3 files directly from the browser using JS and HTML
sphinx4	1,411	almost 4 years ago	Pure Java speech recognition library
sphinxbase	527	about 4 years ago
sphinxtrain	183	over 1 year ago
TLSphinx	15	over 7 years ago	Swift wrapper around Pocketsphinx
Text-to-Speech (TTS)
espeak			eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows.
MARY TTS	2,385	almost 2 years ago	MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java
Ossian			Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision
Automatic Speech Recognition (ASR)
Elpis	152	about 2 years ago	Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers
kaldi	14,362	over 1 year ago	This is now the official location of the Kaldi project
Persephone	157	over 3 years ago	Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis
Text automation
clld	54	almost 2 years ago	Cross Linguistic Linked Data python library
LaTeX2HTML5	61	about 4 years ago	LaTeX web components
MultilingualCorporaExtractor	0	about 13 years ago	Node io Spider for extracting multilingual corpora ( of a student project)
SeedLing	2	about 12 years ago	Building and Using A Seed Corpus for the Human Language Project ( of a student project)
Experimentation
experigen	35	almost 6 years ago	A framework for creating linguistic experiments
GamifyPsycholinguisticsExperiments	0	about 14 years ago	A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. ( of a dormant project)
OpenSesame	242	over 1 year ago	Graphical experiment builder for the social sciences
OPrime	0	almost 12 years ago	Open Source Experimentation Libraries - Online and Offline for Android and HTML5
psychopyMegProsody	0	over 13 years ago	Runs MegProsody using PsychoPy
PsychScript	4	over 11 years ago	A HTML5/Javascript library for running behavioural experiments online
Flashcards
Anki	19,289	over 1 year ago	Anki is a program to make and share flaschard decks (including audio) for any language or writing system.
awesome-anki	1,649	over 1 year ago	A curated list of awesome Anki add-ons, decks and resources
VocabLift	3	about 12 years ago	Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay
Natural language generation
OpenCCG	206	over 5 years ago	OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others
Computing systems
Common Language Resources and Technology Infrastructure Norway / Clarino			One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the cluster
Android Applications
Aikuma	30	over 10 years ago	Android software for recording and translation
Android Speech Recognition Trainer	3	over 7 years ago	Speech recognition training app for low resource languages which interfaces with FieldDB corpora
android-template	0	over 11 years ago	This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to
AndroidFieldDB	3	about 7 years ago	An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers
AndroidFieldDBElicitationRecorder	2	over 12 years ago	A general purpose video recording tool
AndroidLanguageLessons	2	over 7 years ago	Lets heritage speakers create self designed language lessons
AndroidProductionExperiment	0	almost 13 years ago	Android App to run perception experiments
Bevara	3	over 12 years ago	Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages
ojoVoz			A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to
pocketsphinx-android	235	over 6 years ago	pocketsphinx build for Android
pocketsphinx-android-demo	549	over 7 years ago
Chrome Extensions
babelfrog	16	over 7 years ago	Chrome extension to help learn languages as you browse
DictionaryChromeExtension	6	over 11 years ago	Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries)
FieldDB
FieldDB	79	over 3 years ago	An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival.
FieldDB / FieldDB Webservices/Components/Plugins
AndroidLanguageLearningClientForFieldDB-sikuli	0	almost 12 years ago	Sikuli tests for AndroidLanguageLearningClientForFieldDB
AuthenticationWebService	0	over 3 years ago	A node.js web service which mananges users and corpora creation and authentication
bower-fielddb-angular	0	almost 11 years ago	A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save
bower-fielddb	0	about 6 years ago	A bower repository which hosts fielddb core components, bower install fielddb --save
fielddb-spreadsheet-sikuli	1	over 11 years ago	sikuli tests for the spreadsheet module
FieldDBActivityFeed	0	over 11 years ago	A fielddb activity feed widget which can be embedded in other codebases, websites etc
FieldDBGlosser	0	over 9 years ago	A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save
FieldDBLexicon	0	over 8 years ago	A lexicon browser/editor web widget for FieldDB databases
LanguageClassDashboard	0	almost 12 years ago	App which provides a view of FieldDB corpora for language teachers
LexiconWebService	0	about 6 years ago	A node.js ElasticSearch wrapper for indexing/training lexicons from corpora
LexiconWebServiceSample	1	about 14 years ago	A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project
Academic Research Paper-Specific Repositories
Gargantua	12	over 10 years ago	Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010
ldc-kiy	0	about 13 years ago	Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation,
Learning to map into a Univerisal POS tagset			Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
low-resource-pos-tagging-2014	9	over 10 years ago	and Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. . In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. . In Proceedings of ACL 2013
orthotree	10	over 11 years ago	Linguistic family tree based on orthographic distance
type-supervised-tagging-2012emnlp	1	over 10 years ago	This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. . In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit
visualizing-language	1	over 14 years ago	For visualizations of WALS and other typological databases
WALS-APiCS	0	over 11 years ago	Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics
Example Repositories
CorpusWebService	0	over 4 years ago	über-simple node.js-Proxy to enable CORS request for couchdb
CorporaForFieldLinguistics	3	about 9 years ago	Small corpora from diverse language typologies, useful for testing scripts
startR	0	over 13 years ago
lucenerevolution-2013	0	over 13 years ago	Demo examples for linguistics in Lucene and Solr
berlin-buzzwords-2013	0	about 13 years ago	Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk
Fonts
fontinline	4	almost 8 years ago	Make inline stroke paths from an outline font
Noto Fonts	2,466	over 3 years ago	Noto is Google’s free font family that aims to support all the world’s scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0
Unicodify			Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII
Corpora
bible-corpus	177	almost 2 years ago	A multilingual parallel corpus created from translations of the Bible
poio-corpus	7	over 1 year ago	The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others
Organizations / On GitHub
batumi			Speech recognition and natural language processing for low-resource languages
BloomBooks
unicode-cldr			Unicode Common Locale Data Repository (CLDR) Project
cmusphinx			Mirror of the SourceForge repositories
dativebase			Tools for working with OLD
divvun			The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages.
FieldDB
GiellaLT			home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source
HFST			Helsinki Finite-State Technology.
hunspell
keymanapp
langtech			Language Technology Group, University of Melbourne
lex4all
longnow
MontrealCorpusTools
moses-smt			Statistical Machine Translation
mukurtucms
NLTK			Natural Language Toolkit
PhonologicalCorpusTools)
Projet de recherche sur l'écriture			Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics)
prosodylab			Prosodylab at McGill University, Canada
SIL International (Dev)			Another SIL organization, with many repositories
SIL International			SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects
SIL NRSI			SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development
StanfordNLP
ucsd-field-lab			University of California, San Diego
UniversalDependencies			Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary
utcompling			The University of Texas at Austin's Computational Linguistics Lab.
Organizations / Other OSS Organizations
Giellatekno			Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found , sorted by language
LOWLANDS			LOWLANDS – Parsing low-resource languages and domains
LTRC: Language Technologies Research Center IIIT Hyderabad			LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above
The Language Archive			Part of the MPI
Tutorials
How to Write a Spelling Corrector			by
Language Specific Projects / Afrikaans
Afrikaanse rekenaarlinguïstiek (Afrikaans computational linguistics)			— wordlists, corpora, morphological analyser, tagger, word decompounder. Available upon email
Language Specific Projects / Albanian
Apertium rules for Albanian			Machine Translation rules
out-of-copyright-albanian-authors			authors scraped from the albanian language wikipedia who are out of copyright
Plis keyboard			The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language
spell checking			Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included
Language Specific Projects / Alutiiq
wiinaq	2	over 3 years ago	Word Wiinaq is a dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django
Language Specific Projects / Amharic
HornMorpho	5	over 11 years ago	Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs
Language Specific Projects / Basque
Matxin			An open-source transfer machine translation engine. Linguistic information for the translation from Spanish and Basque (es-eu) is included
Language Specific Projects / Bengali
Bangla-অঙ্কুর for Mac			This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future
Bengali Writer	1	over 10 years ago	`Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: )
Ekushey			Bangla Computing and Localization Project for the Bangla speaking people
Lekho	0	over 10 years ago	A collection of tools and resources for using bangla on computers (Original project is on SourceForge: )
Language Specific Projects / Chichewa
Chichewa	9	over 5 years ago	NLP resources for Chichewa
Language Specific Projects / Galician
an-metri-gal	3	over 1 year ago	Análise métrico de texto en verso en lingua galega (Galician language) gl-ES
android_gl_dict	2	over 13 years ago	Android Galician (gl_ES) Keyboard Dictionary
aspell-gl	1	about 14 years ago	Galician dictionary for aspell
CitiusSentiment	7	about 10 years ago	Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician
CitiusTagger			A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish
Conshuga			Galician verb conjugator
corpora	2	over 10 years ago	This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia)
DepPattern	10	about 8 years ago	Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser
DOGA_scraper	0	almost 12 years ago	Galician Official journal scraper
elFinder-language	1	almost 10 years ago	Galician - Gallego / language for elFinder
EuroWordNetLemon	1	almost 11 years ago	EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician
GalegoDroid			Galician Translator for Android
galeXtra	2	about 10 years ago	Multiword Extractor for Portuguese, English, Spanish, Galician, French
Galician-Dependency-Treebank	1	almost 10 years ago	This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006
Galician-Fuzzy-Text-watch	1	over 10 years ago	Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time
galician-locale-for-mac	1	over 10 years ago	Galician locale for Mac OS X
gl-syllabler	1	over 10 years ago	Split galician language words into syllables
gl	1	almost 2 years ago	Galician OmegaT Localisation
hunspell-gl-ciencias	0	almost 13 years ago	Project oriented into developing a science and maths Galician language Hunspell dictionary
hunspell-gl	1	over 13 years ago	Galician hunspell dictionaries
hyphen-gl	1	over 14 years ago	Galician hyphenation rules
javagalician-java6	3	about 14 years ago	The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution
Linguakit	65	over 2 years ago	Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc
ParlamentoGalicia	0	over 13 years ago	Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament
poss-gl	1	about 15 years ago	Galician translation of Producing Open Source Software, by Karl Fogel
rima	1	over 10 years ago	Find rhyming words in galician language
stopwords-gl	1	almost 10 years ago	Galician stopwords collection
texlive-babel-galician	1	over 1 year ago	TeXLive babel-galician package
UD_Galician-CTG	1	over 1 year ago	The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group
UD_Galician-TreeGal	6	over 1 year ago	The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña)
UL_Galician-TreeGal	0	about 8 years ago	CoNLL-UL Repository for UD_Galician-TreeGal
Language Specific Projects / Galician / Apertium
apertium-cat-glg	1	about 4 years ago	Apertium translation pair for Catalan and Galician
apertium-dict-en-gl	1	over 10 years ago	English-Galician language pair for Apertium
apertium-dict-es-gl	1	over 10 years ago	Spanish-Galician language pair for Apertium
apertium-dict-pt-gl	1	about 13 years ago	Portuguese-Galician language pair for Apertium
apertium-en-gl	0	about 4 years ago	Apertium translation pair for English and Galician
apertium-es-gl	1	about 5 years ago	Apertium translation pair for Spanish and Galician
apertium-glg	0	almost 4 years ago	Apertium linguistic data for Galician
Apertium-pt-gl.pt-gl-LMF	0	about 12 years ago	This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages
apertium-pt-gl	0	about 5 years ago	Apertium translation pair for Portuguese and Galician
Language Specific Projects / Georgian
awesome-georgia	90	almost 3 years ago	A curated list of awesome libraries and packages specific/related to Georgia (country)
Gadatsqvetilebebi	1	over 9 years ago	გადაწყვეტილებები; Web spider and corpora importer for public legal decisions
GeoWordsDatabase	70	over 8 years ago	Around 310 000 unique Georgian words
Kartuli Speech Recognition	4	over 8 years ago	ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes."
KartuliChromeExtension	1	over 12 years ago	Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად
QartuliDaBunebismetkveleba	1	over 12 years ago	მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის
SakartvelosUzenaesiSasamartloSarke	0	about 12 years ago	საქართველოს უზენაესი სასამართლო სარკე
SamartlosSakonstitutsioSasamartdoSarke	0	over 9 years ago	სამართლოს საკონსტიტუციო სასამართდო სარკე
translitit-latin-to-mkhedruli-georgian	4	over 9 years ago	A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript
translitit-mkhedruli-georgian-to-ipa	0	over 9 years ago	A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript
Declensions	2	over 3 years ago	Methods to generate declensions for Georgian language
Language Specific Projects / Georgian / Fonts
Stichoza/font-larisome	39	over 5 years ago	Iconic font for Georgian currency inspired by Font-Awesome (CSS)
Lotuashvili/BPGNateli	0	almost 11 years ago	Bower package for BPG Nateli font (CSS)
thecotne/georgian-webfonts	17	almost 9 years ago	Package for georgian fonts (CSS)
Language Specific Projects / Georgian / Internationalization and Localization (i18n/l10n)
Stichoza/money-num-to-string	8	over 2 years ago	Convert a number/money to localized string (PHP, JavaScript)
natchkebiailia/NumberToWord	3	almost 9 years ago	Convert numbers to localized strings (JavaScript)
d0ragon/number-to-words-ka	3	about 12 years ago	Convert numbers to localized strings (PHP)
dimakura/ka	0	over 12 years ago	Common functionality for georgian projects (Ruby)
dimakura/ka.js	5	almost 12 years ago	Georgian language support for node and browser (JavaScript)
akalongman/kautilities	4	about 10 years ago	Convert Georgian letters to Latin and vice-versa (PHP)
Landish/Laravel-Ka	8	over 5 years ago	Georgian Language Pack
Landish/RedactorJS-GE			Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript)
wenzhixin/bootstrap-table	11,746	over 1 year ago	Bootstrap table with extra features. l10n by and
moment/moment	48,013	almost 2 years ago	A lightweight date library (JavaScript)
ioseb/geokbd	57	over 16 years ago	Georgian keyboard library (JavaScript)
Language Specific Projects / Guarani
ParaMorfo	5	over 11 years ago	morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives
Language Specific Projects / Hausa
Hausa	6	almost 11 years ago	Repository for Hausa NLP tools
Language Specific Projects / Hindi
hindi-morph	0	over 13 years ago	An open source morphological analyzer for Hindi
Language Specific Projects / Høgnorsk
hunspell-hn_NO			A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses
Language Specific Projects / Icelandic
IceNLP	21	over 2 years ago	IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java
Language Specific Projects / Inuktitut
InuktitutAlignerData	3	about 14 years ago	Scripts for alignment of laboratory speech production data
InuktitutComputing	10	almost 11 years ago	Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at
Language Specific Projects / Irish
aimsigh	1	almost 3 years ago	Source for the now-defunct aimsigh.com Irish search engine
caighdean	18	almost 2 years ago	Code for standardizing Irish language text
fleiscin	1	over 5 years ago	Irish hyphenation patterns for TeX
GaelSpell	17	over 1 year ago	Sources for an Irish language spell checker
tesseract-gle-uncial	4	over 11 years ago	OCR for old Irish fonts
Language Specific Projects / Kinyarwanda
kin-morph-fst	6	almost 13 years ago	Kinyarwanda morphological analyzer
TurboTagger & TurboParser for Kinyarwanda (download)			TurboTagger & TurboParser for Kinyarwanda
Language Specific Projects / Kurdish
Kurlex			Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR
kurmanji-stemmer	1	almost 11 years ago	NLTK based kurmanji stemmer
Language Specific Projects / Lingala
Lingala NLP			NLP tools and resources for Lingala
Language Specific Projects / Lushootseed
Lushootseed	0	about 10 years ago	Joshua Crowgey's work on Lushootseed
Language Specific Projects / Malay
MorfoMalayu	5	over 11 years ago	morphological analysis of Malay words
Language Specific Projects / Malagasy
Global Voices Malagasy Project			This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau
Language Specific Projects / Manx
aspell-gv	1	about 14 years ago	Manx Gaelic dictionary for aspell
gaelg	3	almost 2 years ago	NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine
Language Specific Projects / Migmaq
migmaq-lessons	1	about 11 years ago	Repository for website building Mi'gmaq language lessons
Language Specific Projects / Minderico
fredericajordarzambarino	0	almost 12 years ago	A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show
Language Specific Projects / Nishnaabe
Ojibway-iphone-app	0	almost 11 years ago	An iPhone app with audio and images for learning the Ojibway language
OjibwayMap	1	almost 11 years ago	An iPhone app with audio and images for learning Ojibway language and culture
nishanimate	1	about 11 years ago	A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text
Language Specific Projects / Oromo
hornmorpho	5	over 11 years ago	morphological analysis and generation of amharic and oromo verbs and nouns. and tigrinya verbs
Language Specific Projects / Quechua
AntiMorfo	5	over 11 years ago	morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs
Morphology, spellchecker			XFST and FOMA, plus OpenOffice plugin
Language Specific Projects / Sami
divvun-webdemo	2	about 3 years ago	simple webdemo for divvun grammar checker.
Giellatekno			A host of Sámi tools
Oahpa!			A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
Neahttadigisánit			A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography ( will be recognized also as ), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for . Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:
Language Specific Projects / Scottish Gaelic
aspell-gd	1	about 14 years ago	Scottish Gaelic dictionary for aspell
briathrachan	2	almost 10 years ago	This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS
gaidhlig	3	about 3 years ago	NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines
gd-fcfg	3	over 14 years ago	Context-free feature-based grammar of Scottish Gaelic in the NLTK format
gdbank	4	over 1 year ago	Some tools and resources for natural language processing of Scottish Gaelic.
hunspell-gd	10	over 3 years ago	Files for building Scottish Gaelic spell checkers
Language Specific Projects / Secwepemctsín
secwepemctsnem	2	almost 16 years ago	A project to help people learn Secwepemctsín
Language Specific Projects / Somali
somorph			Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on repository
qaamuus.net			morphologically aware dictionary based on lexical resources found online, and the somali morphology
Language Specific Projects / Tigrinya
HornMorpho	5	over 11 years ago	morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs
Language Specific Projects / Uralic
UralicNLP	71	over 1 year ago	A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time
Language Specific Projects / Zulu
Ukwabelana			An open-source morphological Zulu corpus

low-resource-languages

Generic Repositories / Single language lexicography projects and utilities / Utilities

Generic Repositories / Software

Keyboard Layout Configuration Helpers

Annotation

Format Specifications

i18n-related Repositories

Audio automation

Text-to-Speech (TTS)

Automatic Speech Recognition (ASR)

Text automation

Experimentation

Flashcards

Natural language generation

Computing systems

Android Applications

Chrome Extensions

FieldDB

FieldDB / FieldDB Webservices/Components/Plugins

Academic Research Paper-Specific Repositories

Example Repositories

Fonts

Corpora

Organizations / On GitHub

Organizations / Other OSS Organizations

Tutorials

Language Specific Projects / Afrikaans

Language Specific Projects / Albanian

Language Specific Projects / Alutiiq

Language Specific Projects / Amharic

Language Specific Projects / Basque

Language Specific Projects / Bengali

Language Specific Projects / Chichewa

Language Specific Projects / Galician

Language Specific Projects / Galician / Apertium

Language Specific Projects / Georgian

Language Specific Projects / Georgian / Fonts

Language Specific Projects / Georgian / Internationalization and Localization (i18n/l10n)

Language Specific Projects / Guarani

Language Specific Projects / Hausa

Language Specific Projects / Hindi

Language Specific Projects / Høgnorsk

Language Specific Projects / Icelandic

Language Specific Projects / Inuktitut

Language Specific Projects / Irish

Language Specific Projects / Kinyarwanda

Language Specific Projects / Kurdish

Language Specific Projects / Lingala

Language Specific Projects / Lushootseed

Language Specific Projects / Malay

Language Specific Projects / Malagasy

Language Specific Projects / Manx

Language Specific Projects / Migmaq

Language Specific Projects / Minderico

Language Specific Projects / Nishnaabe

Language Specific Projects / Oromo

Language Specific Projects / Quechua

Language Specific Projects / Sami

Language Specific Projects / Scottish Gaelic

Language Specific Projects / Secwepemctsín

Language Specific Projects / Somali

Language Specific Projects / Tigrinya

Language Specific Projects / Uralic

Language Specific Projects / Zulu

Backlinks from these awesome lists:

More related projects: