poio-corpus

Language dataset

A collection of language resources extracted from publicly available sources.

The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.

GitHub

7 stars

7 watching

1 forks

Language: Python

last commit: over 1 year ago

Linked from 1 awesome list

Screenshot of Poio-NLP/poio-corpus website

www.poio.eu

Backlinks from these awesome lists:

richardlitt/low-resource-languages

Related projects:

Repository	Description	Stars
cidles/poio-analyzer	A collection of software tools for linguists to manage and analyze linguistic data	13
cidles/poio-api	A Python library for converting linguistic data from various formats into unified annotation graphs.	18
fido-ai/ua-datasets	Provides a collection of datasets for natural language processing in Ukrainian.	57
alexa/massive	A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset	541
proycon/python-frog	A Python binding to a C++ NLP tool for Dutch language processing tasks	47
rodrigopivi/chatito	A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language.	877
dativebase/old	Software for creating collaborative databases of language data	1
alvations/seedling	A corpus and API for human language data	11
01-ai/yi	A series of large language models trained from scratch to excel in multiple NLP tasks	7,743
clio-lang/clio	A functional programming language that compiles to JavaScript and is designed for distributed scientific computing.	938
thu-coai/cdial-gpt	A large-scale Chinese conversation dataset and pre-trained dialog models for text generation	1,799
mirfan899/urdu	A collection of Urdu language datasets for various NLP tasks and applications	71
poio-nlp/pressagio	A Python library that uses n-gram models to predict text completions	19
louisowen6/nlp_bahasa_resources	A curated collection of NLP datasets and resources for Bahasa Indonesia	496
karthikncode/nlp-datasets	A curated list of Natural Language Processing datasets used to train and evaluate NLP models.	919