 poio-corpus
 poio-corpus 
 Language dataset
 A collection of language resources extracted from publicly available sources.
The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
7 stars
 7 watching
 1 forks
 
Language: Python 
last commit: 11 months ago 
Linked from   1 awesome list  
 Related projects:
| Repository | Description | Stars | 
|---|---|---|
|  | A collection of software tools for linguists to manage and analyze linguistic data | 13 | 
|  | A Python library for converting linguistic data from various formats into unified annotation graphs. | 18 | 
|  | Provides a collection of datasets for natural language processing in Ukrainian. | 57 | 
|  | A collection of tools and modeling code for a large multilingual Natural Language Understanding dataset | 541 | 
|  | A Python binding to a C++ NLP tool for Dutch language processing tasks | 47 | 
|  | A tool for generating datasets for AI chatbots and natural language processing tasks using a simple domain-specific language. | 877 | 
|  | Software for creating collaborative databases of language data | 1 | 
|  | A corpus and API for human language data | 11 | 
|  | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 | 
|  | A functional programming language that compiles to JavaScript and is designed for distributed scientific computing. | 938 | 
|  | A large-scale Chinese conversation dataset and pre-trained dialog models for text generation | 1,799 | 
|  | A collection of Urdu language datasets for various NLP tasks and applications | 71 | 
|  | A Python library that uses n-gram models to predict text completions | 19 | 
|  | A curated collection of NLP datasets and resources for Bahasa Indonesia | 496 | 
|  | A curated list of Natural Language Processing datasets used to train and evaluate NLP models. | 919 |