poio-corpus
Language dataset
A collection of language resources extracted from publicly available sources.
The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
7 stars
7 watching
1 forks
Language: Python
last commit: 11 months ago
Linked from 1 awesome list
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A collection of software tools for linguists to manage and analyze linguistic data | 13 |
| | A Python library for converting linguistic data from various formats into unified annotation graphs. | 18 |
| | Provides a collection of datasets for natural language processing in Ukrainian. | 57 |
| | A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 541 |
| | A Python binding to a C++ NLP tool for Dutch language processing tasks | 47 |
| | A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language. | 877 |
| | Software for creating collaborative databases of language data | 1 |
| | A corpus and API for human language data | 11 |
| | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 |
| | A functional programming language that compiles to JavaScript and is designed for distributed scientific computing. | 938 |
| | A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,799 |
| | A collection of Urdu language datasets for various NLP tasks and applications | 71 |
| | A Python library that uses n-gram models to predict text completions | 19 |
| | A curated collection of NLP datasets and resources for Bahasa Indonesia | 496 |
| | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |