spanish-corpora

Spanish Corpus

A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks.

Unannotated Spanish 3 Billion Words Corpora

GitHub

92 stars
4 watching
10 forks
Language: Python
last commit: about 2 years ago
Linked from 1 awesome list

corporalinguisticsnatural-language-processingnlpspanishspanish-language

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
crscardellino/sbwce A collection of linguistic resources and trained word embeddings for the Spanish language. 45
bertez/corpora A collection of Galician language data in JSON format. 2
cesine/corporaforfieldlinguistics A collection of small datasets from various languages to test and evaluate NLP scripts 3
dccuchile/spanish-word-embeddings A collection of precomputed word embeddings for the Spanish language, derived from different corpora and computational methods. 354
botcenter/spanishwordembeddings This project generates Spanish word embeddings using fastText on large corpora. 9
universaldependencies/ud_galician-ctg This is a collection of annotated text data for the Galician language. 1
dccuchile/beto A pre-trained NLP model trained on Spanish text data using the BERT architecture 490
christos-c/bible-corpus A multilingual parallel corpus created from translations of the Bible. 177
cidles/pyannotation A Python library to access and manipulate linguistically annotated corpus files in various formats. 16
nytud/hucopa A dataset and annotation scheme for Hungarian causal reasoning tasks. 1
botcenter/spanish-sent2vec This project trains a machine learning model to generate sentence embeddings from Spanish text data using the sent2vec algorithm. 4
several27/fakenewscorpus A large dataset of news articles with labeled categories to train fake news recognition algorithms 385
dav009/latinamericantextresources A collection of linguistic and text resources for Latin America 6
poltextlab/hunempoli_corpus A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language. 0
vadno/korkor_pilot A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks. 2