CorporaForFieldLinguistics

Language corpora

A collection of small datasets from various languages to test and evaluate NLP scripts

Small corpora from diverse language typologies, useful for testing scripts

GitHub

3 stars

1 watching

5 forks

Language: HTML

last commit: about 9 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

richardlitt/low-resource-languages

Related projects:

Repository	Description	Stars
cesine/toolsforfieldlinguistics	A collection of reusable scripts and tools for fieldlinguistics research	9
josecannete/spanish-corpora	A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks.	92
fielddb/lex4all	Tool for automating pronunciation lexicon creation for low-resource languages using speech recognition and machine learning algorithms.	1
bertez/corpora	A collection of Galician language data in JSON format.	2
phonologicalcorpustools/corpustools	A collection of tools and libraries for analyzing and processing phonological data in various languages	115
languagemachines/libfolia	A C++ library for working with linguistic annotation formats	16
lex4all/lex4all	Software tool to generate pronunciation lexicons for low-resource languages using speech recognition and machine learning algorithms.	21
pld-linux/apertium-dict-es-gl	A dictionary file for machine translation between two languages using a specific rule-based machine translation system	1
kscanne/chichewa	A collection of NLP resources for a Bantu language, including a basic lexicon and script for morphological generation.	9
somelinguist/vocablift	Language-learning tool that organizes vocabulary from LIFT-format dictionaries into digital flashcards.	3
fielddb/lucenerevolution-2013	Demos and examples for utilizing linguistics in natural language processing with Lucene and Solr	0
poio-nlp/poio-corpus	A collection of language resources extracted from publicly available sources.	7
crscardellino/sbwce	A collection of linguistic resources and trained word embeddings for the Spanish language.	45
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919
digitallinguistics/data-format	A data format standard for digital humanities and linguistics corpora, specifying a JSON schema for structured representation of linguistic data.	22