spanish-corpora
Spanish Corpus
A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks.
Unannotated Spanish 3 Billion Words Corpora
92 stars
4 watching
10 forks
Language: Python
last commit: about 2 years ago
Linked from 1 awesome list
corporalinguisticsnatural-language-processingnlpspanishspanish-language
Related projects:
Repository | Description | Stars |
---|---|---|
crscardellino/sbwce | A collection of linguistic resources and trained word embeddings for the Spanish language. | 45 |
bertez/corpora | A collection of Galician language data in JSON format. | 2 |
cesine/corporaforfieldlinguistics | A collection of small datasets from various languages to test and evaluate NLP scripts | 3 |
dccuchile/spanish-word-embeddings | A collection of precomputed word embeddings for the Spanish language, derived from different corpora and computational methods. | 356 |
botcenter/spanishwordembeddings | This project generates Spanish word embeddings using fastText on large corpora. | 9 |
universaldependencies/ud_galician-ctg | This is a collection of annotated text data for the Galician language. | 1 |
dccuchile/beto | A pre-trained NLP model trained on Spanish text data using the BERT architecture | 492 |
christos-c/bible-corpus | A multilingual parallel corpus created from translations of the Bible. | 176 |
cidles/pyannotation | A Python library to access and manipulate linguistically annotated corpus files in various formats. | 16 |
nytud/hucopa | A dataset of Hungarian translations of English 'cause-and-effect' questions with plausible alternative answers | 1 |
botcenter/spanish-sent2vec | This project trains a machine learning model to generate sentence embeddings from Spanish text data using the sent2vec algorithm. | 4 |
several27/fakenewscorpus | A large dataset of news articles with labeled categories to train fake news recognition algorithms | 387 |
dav009/latinamericantextresources | Provides a collection of Latin American language and cultural resources for text processing and mining | 6 |
poltextlab/hunempoli_corpus | A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language. | 0 |
vadno/korkor_pilot | A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks. | 2 |