spanish-corpora

Spanish Corpus

A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks.

Unannotated Spanish 3 Billion Words Corpora

GitHub

92 stars
4 watching
10 forks
Language: Python
last commit: about 2 years ago
Linked from 1 awesome list

corporalinguisticsnatural-language-processingnlpspanishspanish-language

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
crscardellino/sbwce A collection of linguistic resources and trained word embeddings for the Spanish language. 45
bertez/corpora A collection of Galician language data in JSON format. 2
cesine/corporaforfieldlinguistics A collection of small datasets from various languages to test and evaluate NLP scripts 3
dccuchile/spanish-word-embeddings A collection of precomputed word embeddings for the Spanish language, derived from different corpora and computational methods. 356
botcenter/spanishwordembeddings This project generates Spanish word embeddings using fastText on large corpora. 9
universaldependencies/ud_galician-ctg This is a collection of annotated text data for the Galician language. 1
dccuchile/beto A pre-trained NLP model trained on Spanish text data using the BERT architecture 492
christos-c/bible-corpus A multilingual parallel corpus created from translations of the Bible. 176
cidles/pyannotation A Python library to access and manipulate linguistically annotated corpus files in various formats. 16
nytud/hucopa A dataset of Hungarian translations of English 'cause-and-effect' questions with plausible alternative answers 1
botcenter/spanish-sent2vec This project trains a machine learning model to generate sentence embeddings from Spanish text data using the sent2vec algorithm. 4
several27/fakenewscorpus A large dataset of news articles with labeled categories to train fake news recognition algorithms 387
dav009/latinamericantextresources Provides a collection of Latin American language and cultural resources for text processing and mining 6
poltextlab/hunempoli_corpus A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language. 0
vadno/korkor_pilot A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks. 2