CorporaForFieldLinguistics
Language corpora
A collection of small datasets from various languages to test and evaluate NLP scripts
Small corpora from diverse language typologies, useful for testing scripts
3 stars
1 watching
5 forks
Language: HTML
last commit: over 7 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
cesine/toolsforfieldlinguistics | A collection of reusable scripts and tools for fieldlinguistics research | 9 |
josecannete/spanish-corpora | A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks. | 92 |
fielddb/lex4all | Tool for automating pronunciation lexicon creation for low-resource languages using speech recognition and machine learning algorithms. | 1 |
bertez/corpora | A collection of Galician language data in JSON format. | 2 |
phonologicalcorpustools/corpustools | A collection of tools and libraries for analyzing and processing phonological data in various languages | 115 |
languagemachines/libfolia | A C++ library for working with linguistic annotation formats | 16 |
lex4all/lex4all | Software tool to generate pronunciation lexicons for low-resource languages using speech recognition and machine learning algorithms. | 21 |
pld-linux/apertium-dict-es-gl | A dictionary file for machine translation between two languages using a specific rule-based machine translation system | 1 |
kscanne/chichewa | A collection of NLP resources for a Bantu language, including a basic lexicon and script for morphological generation. | 9 |
somelinguist/vocablift | Language-learning tool that organizes vocabulary from LIFT-format dictionaries into digital flashcards. | 3 |
fielddb/lucenerevolution-2013 | Demos and examples for utilizing linguistics in natural language processing with Lucene and Solr | 0 |
poio-nlp/poio-corpus | A collection of language resources extracted from publicly available sources. | 7 |
crscardellino/sbwce | A collection of linguistic resources and trained word embeddings for the Spanish language. | 45 |
karthikncode/nlp-datasets | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |
digitallinguistics/data-format | A data format standard for digital humanities and linguistics corpora, specifying a JSON schema for structured representation of linguistic data. | 22 |