CorporaForFieldLinguistics

Language corpora

A collection of small datasets from various languages to test and evaluate NLP scripts

Small corpora from diverse language typologies, useful for testing scripts

GitHub

3 stars
1 watching
5 forks
Language: HTML
last commit: over 7 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
cesine/toolsforfieldlinguistics A collection of reusable scripts and tools for fieldlinguistics research 9
josecannete/spanish-corpora A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks. 92
fielddb/lex4all Tool for automating pronunciation lexicon creation for low-resource languages using speech recognition and machine learning algorithms. 1
bertez/corpora A collection of Galician language data in JSON format. 2
phonologicalcorpustools/corpustools A collection of tools and libraries for analyzing and processing phonological data in various languages 113
languagemachines/libfolia A C++ library for working with linguistic annotation formats 15
lex4all/lex4all Software tool to generate pronunciation lexicons for low-resource languages using speech recognition and machine learning algorithms. 21
pld-linux/apertium-dict-es-gl A dictionary file for machine translation between two languages using a specific rule-based machine translation system 1
kscanne/chichewa A collection of NLP resources for a Bantu language, including a basic lexicon and script for morphological generation. 9
somelinguist/vocablift Language-learning tool that organizes vocabulary from LIFT-format dictionaries into digital flashcards. 3
fielddb/lucenerevolution-2013 This project provides demo examples and tools for exploring linguistic features in Lucene and Solr, two popular search engine technologies. 0
poio-nlp/poio-corpus A collection of language resources extracted from publicly available sources. 7
crscardellino/sbwce A collection of linguistic resources and trained word embeddings for the Spanish language. 45
karthikncode/nlp-datasets A curated list of Natural Language Processing datasets used to train and evaluate NLP models. 919
digitallinguistics/data-format A data format standard for digital humanities and linguistics corpora, specifying a JSON schema for structured representation of linguistic data. 21