spanish-corpora

Spanish Corpus

A collection of unannotated Spanish text data, compiled from various sources and processed for natural language processing tasks.

Unannotated Spanish 3 Billion Words Corpora

GitHub

92 stars

4 watching

10 forks

Language: Python

last commit: over 3 years ago

Linked from 1 awesome list

corporalinguisticsnatural-language-processingnlpspanishspanish-language

Backlinks from these awesome lists:

keon/awesome-nlp

Related projects:

Repository	Description	Stars
crscardellino/sbwce	A collection of linguistic resources and trained word embeddings for the Spanish language.	45
bertez/corpora	A collection of Galician language data in JSON format.	2
cesine/corporaforfieldlinguistics	A collection of small datasets from various languages to test and evaluate NLP scripts	3
dccuchile/spanish-word-embeddings	A collection of precomputed word embeddings for the Spanish language, derived from different corpora and computational methods.	354
botcenter/spanishwordembeddings	This project generates Spanish word embeddings using fastText on large corpora.	9
universaldependencies/ud_galician-ctg	This is a collection of annotated text data for the Galician language.	1
dccuchile/beto	A pre-trained NLP model trained on Spanish text data using the BERT architecture	490
christos-c/bible-corpus	A multilingual parallel corpus created from translations of the Bible.	177
cidles/pyannotation	A Python library to access and manipulate linguistically annotated corpus files in various formats.	16
nytud/hucopa	A dataset and annotation scheme for Hungarian causal reasoning tasks.	1
botcenter/spanish-sent2vec	This project trains a machine learning model to generate sentence embeddings from Spanish text data using the sent2vec algorithm.	4
several27/fakenewscorpus	A large dataset of news articles with labeled categories to train fake news recognition algorithms	385
dav009/latinamericantextresources	A collection of linguistic and text resources for Latin America	6
poltextlab/hunempoli_corpus	A manually annotated corpus for training and testing machine learning models of Aspect Based Sentiment Analysis (ABSA) in Hungarian language.	0
vadno/korkor_pilot	A large annotated corpus of Hungarian text with various linguistic annotations, split into development and test datasets for natural language processing tasks.	2