poio-corpus
Language dataset
A collection of language resources extracted from publicly available sources.
The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
7 stars
7 watching
1 forks
Language: Python
last commit: 2 months ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
| A collection of software tools for linguists to manage and analyze linguistic data | 13 |
| A Python library for converting linguistic data from various formats into unified annotation graphs. | 18 |
| Provides a collection of datasets for natural language processing in Ukrainian. | 57 |
| A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 541 |
| A Python binding to a C++ NLP tool for Dutch language processing tasks | 47 |
| A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language. | 877 |
| Software for creating collaborative databases of language data | 1 |
| A corpus and API for human language data | 11 |
| A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 |
| A functional programming language that compiles to JavaScript and is designed for distributed scientific computing. | 938 |
| A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,799 |
| A collection of Urdu language datasets for various NLP tasks and applications | 71 |
| A Python library that uses n-gram models to predict text completions | 19 |
| A curated collection of NLP datasets and resources for Bahasa Indonesia | 496 |
| A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |