awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

Archived

GitHub

2k stars
78 watching
261 forks
Language: Python
last commit: over 3 years ago
Linked from 1 awesome list

awesomeawesome-listbertcontextualized-representationcross-lingualembedding-modelslanguage-modelnatural-languagenlppretrained-embeddingpretrained-language-modelpretrained-modelssentence-embeddingssentence-representationssubword-modelsunsupervised-learningword-embeddingswordembedding

awesome-sentence-embedding / Word Embeddings

WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models
RusVectōrēs
Efficient Estimation of Word Representations in Vector Space
C 1,527 over 1 year ago
Word2Vec
Word Representations via Gaussian Embedding
Cython 190 over 6 years ago
A Probabilistic Model for Learning Multi-Prototype Word Embeddings
DMTK 116 over 8 years ago
Dependency-Based Word Embeddings
C++
word2vecf
GloVe: Global Vectors for Word Representation
C 6,885 about 1 year ago
GloVe 6,885 about 1 year ago
Sparse Overcomplete Word Vector Representations
C++ 54 about 7 years ago
From Paraphrase Database to Compositional Paraphrase Model and Back
Theano 30 almost 9 years ago
PARAGRAM
Non-distributional Word Vector Representations
Python 62 about 7 years ago
WordFeat 62 about 7 years ago
Joint Learning of Character and Word Embeddings
C 299 about 4 years ago
SensEmbed: Learning Sense Embeddings for Word and Relational Similarity
SensEmbed
Topical Word Embeddings
Cython 315 over 6 years ago
Swivel: Improving Embeddings by Noticing What's Missing
TF 77,177 6 days ago
Counter-fitting Word Vectors to Linguistic Constraints
Python 144 over 4 years ago
counter-fitting (broken)
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
Chainer 3,149 about 3 years ago
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations
Theano
Siamese CBOW
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Go 803 almost 4 years ago
lexvec 803 almost 4 years ago
Enriching Word Vectors with Subword Information
C++ 25,945 8 months ago
fastText
Morphological Priors for Probabilistic Neural Word Embeddings
Theano 52 almost 8 years ago
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
C++ 23 about 1 year ago
charNgram2vec
ConceptNet 5.5: An Open Multilingual Graph of General Knowledge
Python 1,295 over 2 years ago
Numberbatch 1,295 over 2 years ago
Learning Word Meta-Embeddings
Meta-Emb (broken)
Offline bilingual word vectors, orthogonal transformations and the inverted softmax
Python 1,197 over 1 year ago
Multimodal Word Distributions
TF 283 over 5 years ago
word2gm 283 over 5 years ago
Poincaré Embeddings for Learning Hierarchical Representations
Pytorch 1,681 4 months ago
Context encoders as a simple but powerful extension of word2vec
Python 20 over 4 years ago
Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints
TF 64 about 7 years ago
Attract-Repel 64 about 7 years ago
Learning Chinese Word Representations From Glyphs Of Characters
C 30 over 6 years ago
Making Sense of Word Embeddings
Python 212 over 3 years ago
sensegram
Hash Embeddings for Efficient Word Representations
Keras 42 almost 7 years ago
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
Gensim 1,184 about 2 months ago
BPEmb 1,184 about 2 months ago
SPINE: SParse Interpretable Neural Embeddings
Pytorch 52 almost 5 years ago
SPINE
AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP
Gensim 394 over 3 years ago
AraVec 394 over 3 years ago
Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics
C 846 about 5 years ago
Dict2vec : Learning Word Embeddings using Lexical Dictionaries
C++ 115 almost 4 years ago
Dict2vec 115 almost 4 years ago
Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components
C 99 over 5 years ago
Representation Tradeoffs for Hyperbolic Embeddings
Pytorch 372 over 1 year ago
h-MDS 372 over 1 year ago
Dynamic Meta-Embeddings for Improved Sentence Representations
Pytorch 332 about 4 years ago
DME/CDME 332 about 4 years ago
Analogical Reasoning on Chinese Morphological and Semantic Relations
ChineseWordVectors 11,837 about 1 year ago
Probabilistic FastText for Multi-Sense Word Embeddings
C++ 149 over 6 years ago
Probabilistic FastText 149 over 6 years ago
Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
TF 290 over 1 year ago
SynGCN
FRAGE: Frequency-Agnostic Word Representation
Pytorch 118 over 5 years ago
Wikipedia2Vec: An Optimized Tool for LearningEmbeddings of Words and Entities from Wikipedia
Cython 940 7 months ago
Wikipedia2Vec
Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings
ChineseEmbedding
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
C++ 274 over 1 year ago
VCWE: Visual Character-Enhanced Word Embeddings
Pytorch 15 over 5 years ago
VCWE 15 over 5 years ago
Learning Cross-lingual Embeddings from Twitter via Distant Supervision
Text 14 over 4 years ago
An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning
TF 0 about 6 years ago
ViCo: Word Embeddings from Visual Co-occurrences
Pytorch 25 about 5 years ago
ViCo 25 about 5 years ago
Spherical Text Embedding
C 175 about 1 year ago
Unsupervised word embeddings capture latent knowledge from materials science literature
Gensim 619 over 1 year ago

awesome-sentence-embedding / OOV Handling

ALaCarte 104 about 6 years ago :
Mimick 153 about 5 years ago :
CompactReconstruction 9 over 1 year ago :

awesome-sentence-embedding / Contextualized Word Embeddings

Language Models are Unsupervised Multitask Learners
TF 22,516 3 months ago
117M 22,516 3 months ago GPT-2( , , , , , )
Learned in Translation: Contextualized Word Vectors
Pytorch 472 almost 3 years ago
CoVe 472 almost 3 years ago
Universal Language Model Fine-tuning for Text Classification
Pytorch 26,291 about 1 month ago
English ULMFit( , )
Deep contextualized word representations
Pytorch 11,757 almost 2 years ago
AllenNLP ELMO( , )
Efficient Contextualized Representation:Language Model Pruning for Sequence Labeling
Pytorch 146 over 4 years ago
LD-Net 146 over 4 years ago
Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation
Pytorch 1,463 over 3 years ago
ELMo 1,463 over 3 years ago
Direct Output Connection for a High-Rank Language Model
Pytorch 12 almost 6 years ago
DOC
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TF 38,204 4 months ago
BERT 38,204 4 months ago BERT( , , )
Contextual String Embeddings for Sequence Labeling
Pytorch 13,939 6 days ago
Flair 13,939 6 days ago
Improving Language Understanding by Generative Pre-Training
TF 2,160 almost 6 years ago
GPT 2,160 almost 6 years ago
Multi-Task Deep Neural Networks for Natural Language Understanding
Pytorch 2,238 9 months ago
MT-DNN 2,238 9 months ago
BioBERT: pre-trained biomedical language representation model for biomedical text mining
TF 1,954 over 1 year ago
BioBERT 667 over 4 years ago
Cross-lingual Language Model Pretraining
Pytorch 2,891 almost 2 years ago
XLM 2,891 almost 2 years ago
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
TF 3,611 about 2 years ago
Transformer-XL 3,611 about 2 years ago
Efficient Contextual Representation Learning Without Softmax Layer
Pytorch 4 over 4 years ago
SciBERT: Pretrained Contextualized Embeddings for Scientific Text
Pytorch, TF 1,521 over 2 years ago
SciBERT 1,521 over 2 years ago
Publicly Available Clinical BERT Embeddings
Text 674 about 4 years ago
clinicalBERT
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
Pytorch 381 about 2 years ago
ClinicalBERT
ERNIE: Enhanced Language Representation with Informative Entities
Pytorch 1,412 11 months ago
ERNIE
Unified Language Model Pre-training for Natural Language Understanding and Generation
Pytorch 20,176 12 days ago
unilm1-large-cased UniLMv1( , )
HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization
Pre-Training with Whole Word Masking for Chinese BERT
Pytorch, TF 9,687 over 1 year ago
BERT-wwm 9,687 over 1 year ago
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TF 6,182 over 1 year ago
XLNet 6,182 over 1 year ago
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
PaddlePaddle 6,318 3 months ago
ERNIE 2.0 6,318 3 months ago
SpanBERT: Improving Pre-training by Representing and Predicting Spans
Pytorch 891 over 1 year ago
SpanBERT 891 over 1 year ago
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pytorch 30,522 about 1 month ago
RoBERTa 30,522 about 1 month ago
Subword ELMo
Pytorch 12 over 4 years ago
Knowledge Enhanced Contextual Word Representations
TinyBERT: Distilling BERT for Natural Language Understanding
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Pytorch 10,562 6 days ago
BERT-345M Megatron-LM( , )
MultiFiT: Efficient Multi-lingual Language Model Fine-tuning
Pytorch 284 over 4 years ago
Extreme Language Model Compression with Optimal Subwords and Shared Projections
MULE: Multimodal Universal Language Embedding
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
K-BERT: Enabling Language Representation with Knowledge Graph
UNITER: Learning UNiversal Image-TExt Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TF 3,933 about 2 years ago
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Pytorch 30,522 about 1 month ago
bart.base BART( , , , , )
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Pytorch, TF2.0 135,022 6 days ago
DistilBERT 135,022 6 days ago
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TF 6,170 2 months ago
T5 6,170 2 months ago
CamemBERT: a Tasty French Language Model
CamemBERT
ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations
Pytorch 643 over 2 years ago
Unsupervised Cross-lingual Representation Learning at Scale
Pytorch 2,891 almost 2 years ago
xlmr.large XLM-R (XLM-RoBERTa)( , )
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
Pytorch 691 4 months ago
ProphetNet-large-16GB ProphetNet( , )
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Pytorch 2,249 over 1 year ago
CodeBERT
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training
Pytorch 20,176 12 days ago
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
TF 2,340 8 months ago
ELECTRA-Small ELECTRA( , , )
MPNet: Masked and Permuted Pre-training for Language Understanding
Pytorch 288 about 3 years ago
MPNet
ParsBERT: Transformer-based Model for Persian Language Understanding
Pytorch 332 over 1 year ago
ParsBERT
Language Models are Few-Shot Learners
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training
Pytorch 20,176 12 days ago

awesome-sentence-embedding / Pooling Methods

SIF 1,083 over 5 years ago :
TF-IDF 9 almost 2 years ago :
P-norm 185 almost 4 years ago :
DisC 54 over 4 years ago :
GEM 19 almost 6 years ago :
SWEM 284 almost 2 years ago :
VLAWE 10 over 5 years ago :
Efficient Sentence Embedding using Discrete Cosine Transform
fse: Gensim add-on for fast sentence embeddings. Supports Mean, Max, SIF, uSIF 618 over 1 year ago
Efficient Sentence Embedding via Semantic Subspace Analysis

awesome-sentence-embedding / Encoders

Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings
Python 5 over 5 years ago
Distributed Representations of Sentences and Documents
Pytorch 412 almost 2 years ago
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
Theano 426 almost 8 years ago
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Theano 2,047 over 4 years ago
Order-Embeddings of Images and Language
Theano 186 about 8 years ago
Towards Universal Paraphrastic Sentence Embeddings
Theano 193 almost 9 years ago
From Word Embeddings to Document Distances
C, Python 538 6 months ago
Learning Distributed Representations of Sentences from Unlabelled Data
Python 124 over 7 years ago
Charagram: Embedding Words and Sentences via Character n-grams
Theano 125 over 8 years ago
Learning Generic Sentence Representations Using Convolutional Neural Networks
Theano 34 about 7 years ago
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
C++ 1,193 over 2 years ago
Learning to Generate Reviews and Discovering Sentiment
TF 1,510 over 1 year ago
Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings
Theano 33 over 7 years ago
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Pytorch 2,280 about 3 years ago
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Pytorch 489 almost 3 years ago
Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm
Keras 1,518 4 months ago
StarSpace: Embed All The Things!
C++ 3,946 almost 2 years ago
DisSent: Learning Sentence Representations from Explicit Discourse Relations
Pytorch 33 over 4 years ago
Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
Theano 102 12 months ago
Dual-Path Convolutional Image-Text Embedding with Instance Loss
Matlab 287 over 1 year ago
An efficient framework for learning sentence representations
TF 205 over 5 years ago
Universal Sentence Encoder
TF-Hub
End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions
Theano 16 over 6 years ago
Learning general purpose distributed sentence representations via large scale multi-task learning
Pytorch 311 over 4 years ago
Embedding Text in Hyperbolic Spaces
TF 8 about 7 years ago
Representation Learning with Contrastive Predictive Coding
Keras 525 over 5 years ago
Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations
Python 21 almost 4 years ago
Learning Universal Sentence Representations with Mean-Max Attention Autoencoder
TF 16 about 6 years ago
Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model
TF-Hub
Improving Sentence Representations with Consensus Maximisation
BioSentVec: creating sentence embeddings for biomedical texts
Python 578 over 1 year ago
Word Mover's Embedding: From Word2Vec to Document Embedding
C, Python 81 almost 6 years ago
A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Pytorch 1,191 over 1 year ago
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Pytorch 3,599 7 months ago
Convolutional Neural Network for Universal Sentence Embeddings
Theano 2 over 6 years ago
No Training Required: Exploring Random Encoders for Sentence Classification
Pytorch 184 over 4 years ago
CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model
Pytorch 21 over 5 years ago
GLOSS: Generative Latent Optimization of Sentence Representations
Multilingual Universal Sentence Encoder
TF-Hub
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Pytorch 15,329 6 days ago
SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models
Pytorch 177 almost 4 years ago
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
Pytorch 379 over 1 year ago
Language-agnostic BERT Sentence Embedding
TF-Hub
On the Sentence Embeddings from Pre-trained Language Models
TF 529 over 3 years ago

awesome-sentence-embedding / Evaluation

decaNLP 2,344 11 months ago :
SentEval 2,087 8 months ago :
GLUE 773 over 3 years ago :
Exploring Semantic Properties of Sentence Embeddings
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
Word Embeddings Benchmarks 437 almost 4 years ago :
MLDoc 152 over 2 years ago :
LexNET 77,177 6 days ago :
wordvectors.net 120 over 3 years ago :
jiant 1,644 over 1 year ago :
jiant 1,644 over 1 year ago :
Evaluation of sentence embeddings in downstream and linguistic probing tasks
QVEC 75 almost 7 years ago :
Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments
EQUATE : A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference
Evaluating Word Embedding Models: Methods andExperimental Results
How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
Linguistic Knowledge and Transferability of Contextual Representations :
LINSPECTOR 23 almost 5 years ago :
Pitfalls in the Evaluation of Sentence Embeddings
Probing Multilingual Sentence Representations With X-Probe :

awesome-sentence-embedding / Misc

Word Embedding Dimensionality Selection 329 over 4 years ago :
Half-Size 128 over 3 years ago :
magnitude 1,627 over 1 year ago :
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors :
The Pupil Has Become the Master: Teacher-Student Model-BasedWord Embedding Distillation with Ensemble Learning :
Improving Distributional Similarity with Lessons Learned from Word Embeddings :
Misspelling Oblivious Word Embeddings :
Single Training Dimension Selection for Word Embedding with PCA
Compressing Word Embeddings via Deep Compositional Code Learning :
UER: An Open-Source Toolkit for Pre-training Models :
Situating Sentence Embedders with Nearest Neighbor Overlap
German BERT

awesome-sentence-embedding / Vector Mapping

Cross-lingual Word Vectors Projection Using CCA 56 over 6 years ago :
vecmap 645 over 1 year ago :
MUSE 3,189 about 2 years ago :
CrossLingualELMo 98 almost 5 years ago :

awesome-sentence-embedding / Articles

Comparing Sentence Similarity Methods
The Current Best of Universal Word Embeddings and Sentence Embeddings
On sentence representations, pt. 1: what can you fit into a single #$!%@*&% blog post?
Deep-learning-free Text and Sentence Embedding, Part 1
Deep-learning-free Text and Sentence Embedding, Part 2
An Overview of Sentence Embedding Methods
Word embeddings in 2017: Trends and future directions
A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings
A survey of cross-lingual word embedding models
Introducing state of the art text classification with universal language models
Document Embedding Techniques

Backlinks from these awesome lists: