poio-corpus

Language dataset

A collection of language resources extracted from publicly available sources.

The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.

GitHub

7 stars
7 watching
1 forks
Language: Python
last commit: 11 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
cidles/poio-analyzer A collection of software tools for linguists to manage and analyze linguistic data 13
cidles/poio-api A Python library for converting linguistic data from various formats into unified annotation graphs. 18
fido-ai/ua-datasets Provides a collection of datasets for natural language processing in Ukrainian. 55
alexa/massive A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset 538
proycon/python-frog A Python binding to a C++ NLP tool for Dutch language processing tasks 47
rodrigopivi/chatito A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language. 876
dativebase/old Software for creating collaborative databases of language data 1
alvations/seedling A corpus and API for human language data 11
01-ai/yi A series of large language models trained from scratch to excel in multiple NLP tasks 7,699
clio-lang/clio A functional programming language that compiles to JavaScript and is designed for distributed scientific computing. 935
thu-coai/cdial-gpt A large-scale Chinese conversation dataset and pre-trained dialog models for text generation 1,782
mirfan899/urdu A collection of Urdu language datasets for various NLP tasks and applications 71
poio-nlp/pressagio A Python library that uses n-gram models to predict text completions 19
louisowen6/nlp_bahasa_resources A curated collection of NLP datasets and resources for Bahasa Indonesia 489
karthikncode/nlp-datasets A curated list of Natural Language Processing datasets used to train and evaluate NLP models. 919