poio-corpus
Language dataset
A collection of language resources extracted from publicly available sources.
The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
7 stars
7 watching
1 forks
Language: Python
last commit: 11 months ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
cidles/poio-analyzer | A collection of software tools for linguists to manage and analyze linguistic data | 13 |
cidles/poio-api | A Python library for converting linguistic data from various formats into unified annotation graphs. | 18 |
fido-ai/ua-datasets | Provides a collection of datasets for natural language processing in Ukrainian. | 55 |
alexa/massive | A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 538 |
proycon/python-frog | A Python binding to a C++ NLP tool for Dutch language processing tasks | 47 |
rodrigopivi/chatito | A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language. | 876 |
dativebase/old | Software for creating collaborative databases of language data | 1 |
alvations/seedling | A corpus and API for human language data | 11 |
01-ai/yi | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,699 |
clio-lang/clio | A functional programming language that compiles to JavaScript and is designed for distributed scientific computing. | 935 |
thu-coai/cdial-gpt | A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,782 |
mirfan899/urdu | A collection of Urdu language datasets for various NLP tasks and applications | 71 |
poio-nlp/pressagio | A Python library that uses n-gram models to predict text completions | 19 |
louisowen6/nlp_bahasa_resources | A curated collection of NLP datasets and resources for Bahasa Indonesia | 489 |
karthikncode/nlp-datasets | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |