Awesome Image Captioning / Change Log |
here | 56 | over 2 years ago | May 25 An up-to-date paper list about vision-and-language pre-training is available |
Awesome Image Captioning / Papers / Survey |
A Comprehensive Survey of Deep Learning for Image Captioning | | | Hossain M et al, |
Awesome Image Captioning / Papers / Before |
I2t: Image parsing to text description | | | Yao B Z et al, |
Im2Text: Describing Images Using 1 Million Captioned Photographs | | | Ordonez V et al, |
Deep Captioning with Multimodal Recurrent Neural Networks | | | Mao J et al, |
Awesome Image Captioning / Papers / 2015 |
Show and Tell: A Neural Image Caption Generator | | | Vinyals O et al, |
Deep Visual-Semantic Alignments for Generating Image Descriptions | | | Karpathy A et al, |
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation | | | Chen X et al, |
Long-term Recurrent Convolutional Networks for Visual Recognition and Description | | | Donahue J et al, |
Guiding the Long-Short Term Memory Model for Image Caption Generation | | | Jia X et al, |
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images | | | Mao J et al, |
Expressing an Image Stream with a Sequence of Natural Sentences | | | Park C C et al, |
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention | | | Xu K et al, |
Order-Embeddings of Images and Language | | | Vendrov I et al, |
Generating Images from Captions with Attention | | | Mansimov E et al, |
Learning FRAME Models Using CNN Filters for Knowledge Visualization | | | Lu Y, et al, |
Aligning where to see and what to tell: image caption with region-based attention and scene factorization | | | Jin J et al, |
Awesome Image Captioning / Papers / 2016 |
Image captioning with semantic attention | | | You Q et al, |
DenseCap: Fully Convolutional Localization Networks for Dense Captioning | | | Johnson J et al, |
What value do explicit high level concepts have in vision to language problems? | | | Wu Q et al, |
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data | | | Lisa Anne Hendricks et al, |
SPICE: Semantic Propositional Image Caption Evaluation | | | Anderson P et al, |
Image Captioning with Deep Bidirectional LSTMs | | | Wang C et al, |
Multimodal Pivots for Image Caption Translation | | | Hitschler J et al, |
Image Caption Generation with Text-Conditional Semantic Attention | | | Zhou L et al, |
DeepDiary: Automatic Caption Generation for Lifelogging Image Streams | | | Fan C et al, |
Learning to generalize to new compositions in image understanding | | | Atzmon Y et al, |
Generating captions without looking beyond objects | | | Heuer H et al, |
Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning | | | Chen W et al, |
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering | | | Liu H et al, |
Recurrent Highway Networks with Language CNN for Image Captioning | | | Gu J et al, |
Awesome Image Captioning / Papers / 2017 |
Captioning Images with Diverse Objects | | | Venugopalan S et al, |
Top-down Visual Saliency Guided by Captions | | | Ramanishka V et al, |
Self-Critical Sequence Training for Image Captioning | | | Steven J et al, |
Dense Captioning with Joint Inference and Visual Context | | | Yang L et al, |
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition | | | Yufei W et al, |
A Hierarchical Approach for Generating Descriptive Image Paragraphs | | | Krause J et al, |
Deep Reinforcement Learning-based Image Captioning with Embedding Reward | | | Ren Z et al, |
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects | | | Ting Y et al, |
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning | | | Lu J et al, |
Attend to You: Personalized Image Captioning with Context Sequence Memory Networks | | | CC Park et al, |
SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning | | | Chen L et al, |
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning | | | Qing S et al, |
Areas of Attention for Image Captioning | | | Pedersoli M et al, |
Boosting Image Captioning with Attributes | | | Yao T et al, |
An Empirical Study of Language CNN for Image Captioning | | | Gu J et al, |
Improved Image Captioning via Policy Gradient Optimization of SPIDEr | | | Liu S et al, |
Towards Diverse and Natural Image Descriptions via a Conditional GAN | | | Dai B et al, |
Paying Attention to Descriptions Generated by Image Captioning Models | | | Tavakoliy H R et al, |
Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner | | | Chen T H et al, |
Image Caption with Global-Local Attention | | | Li L et al, |
Reference Based LSTM for Image Captioning | | | Chen M et al, |
Attention Correctness in Neural Image Captioning | | | Liu C et al, |
Text-guided Attention Model for Image Captioning | | | Mun J et al, |
Contrastive Learning for Image Captioning | | | Dai B et al, |
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge | | | Vinyals O et al, |
MAT: A Multimodal Attentive Translator for Image Captioning | | | Liu C et al, |
Actor-Critic Sequence Training for Image Captioning | | | Zhang L et al, |
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? | | | Tanti M et al, |
Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning | | | Xian Y et al, |
Phrase-based Image Captioning with Hierarchical LSTM Model | | | Tan Y H et al, |
Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning | | | Chen H et al, |
Awesome Image Captioning / Papers / 2018 |
Neural Baby Talk | | | Lu J et al, |
Convolutional Image Captioning | | | Aneja J et al, |
Learning to Evaluate Image Captioning | | | Cui Y et al, |
Discriminability Objective for Training Descriptive Captions | | | Luo R et al, |
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text | | | Mathews A et al, |
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | | | Anderson P et al, |
GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints | | | Chen F et al, |
Unpaired Image Captioning by Language Pivoting | | | Gu J et al, |
Recurrent Fusion Network for Image Captioning | | | Jiang W et al, |
Exploring Visual Relationship for Image Captioning | | | Yao T et al, |
Rethinking the Form of Latent States in Image Captioning | | | Dai B et al, |
Boosted Attention: Leveraging Human Attention for Image Captioning | | | Chen S et al, |
"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention | | | Chen T et al, |
Learning to Guide Decoding for Image Captioning | | | Jiang W et al, |
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning | | | Gu J et al, |
Temporal-difference Learning with Sampling Baseline for Image Captioning | | | Chen H et al, |
Partially-Supervised Image Captioning | | | Anderson P et al, |
A Neural Compositional Paradigm for Image Captioning | | | Dai B et al, |
Defoiling Foiled Image Captions | | | Wang J et al, |
Punny Captions: Witty Wordplay in Image Descriptions | | | Chandrasekaran A et al, |
Object Counts! Bringing Explicit Detections Back into Image Captioning | | | Aneja J et al, |
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | Sharma P et al, |
Attacking visual language grounding with adversarial examples: A case study on neural image captioning | | | Chen H et al, |
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions | | | Liu et al, |
Improved Image Captioning with Adversarial Semantic Alignment | | | Melnyk I et al, |
Improving Image Captioning with Conditional Generative Adversarial Nets | | | Chen C et al, |
CNN+CNN: Convolutional Decoders for Image Captioning | | | Wang Q et al, |
Diverse and Controllable Image Captioning with Part-of-Speech Guidance | | | Deshpande A et al, |
Awesome Image Captioning / Papers / 2019 |
Unsupervised Image Captioning | | | Yang F et al, |
Engaging Image Captioning Via Personality | | | Shuster K et al, |
Pointing Novel Objects in Image Captioning | | | Li Y et al, |
Auto-Encoding Scene Graphs for Image Captioning | | | Yang X et al, |
Context and Attribute Grounded Dense Captioning | | | Yin G et al, |
Look Back and Predict Forward in Image Captioning | | | Qin Y et al, |
Self-critical n-step Training for Image Captioning | | | Gao J et al, |
Intention Oriented Image Captions with Guiding Objects | | | Zheng Y et al, |
Describing like humans: on diversity in image captioning | | | Wang Q et al, |
Adversarial Semantic Alignment for Improved Image Captions | | | Dognin P et al, |
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text | | | Gao L et al, |
Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech | | | Aditya D et al, |
Good News, Everyone! Context driven entity-aware captioning for news images | | | Biten A F et al, |
CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection | 50 | almost 5 years ago | Zhang L et al, |
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning | | | Kim D et al, |
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions | | | Cornia M et al, |
Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables | | | Xu Y et al, |
Meta Learning for Image Captioning | | | Li N et al, |
Learning Object Context for Dense Captioning | | | Li X et al, |
Hierarchical Attention Network for Image Captioning | | | Wang W et al, |
Deliberate Residual based Attention Network for Image Captioning | | | Gao L et al, |
Improving Image Captioning with Conditional Generative Adversarial Nets | | | Chen C et al, |
Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding | | | Song L et al, |
Dense Procedure Captioning in Narrated Instructional Videos | | | Shi B et al, |
Informative Image Captioning with External Sources of Information | | | Zhao S et al, |
Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning | | | Fan Z et al, |
Image Captioning with Unseen Objects | | | Demirel et al, |
Look and Modify: Modification Networks for Image Captioning | | | Sammani et al, |
Show, Infer and Tell: Contextual Inference for Creative Captioning | | | Khare et al, |
SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward | | | Yan et al, |
Hierarchy Parsing for Image Captioning | | | Yao T et al, |
Entangled Transformer for Image Captioning | | | Li G et al, |
Attention on Attention for Image Captioning | | | Huang L et al, |
Reflective Decoding Network for Image Captioning | | | Ke L at al, |
Learning to Collocate Neural Modules for Image Captioning | | | Yang X et al, |
Image Captioning: Transforming Objects into Words | | | Herdade S et al, |
Adaptively Aligned Image Captioning via Adaptive Attention Time | | | Huang L et al, |
Variational Structured Semantic Inference for Diverse Image Captioning | | | Chen F et al, |
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations | | | Liu F et al, |
Image Captioning with Compositional Neural Module Networks | | | Tian J et al, |
Exploring and Distilling Cross-Modal Information for Image Captioning | | | Liu F et al, |
Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization | | | Wang H et al, |
Hornet: a hierarchical offshoot recurrent network for improving person re-ID via image captioning | | | Yan S et al, |
Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach | | | Kim D J et al, |
TIGEr: Text-to-Image Grounding for Image Caption Evaluation | | | Jiang M et al, |
REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning | | | Jiang M et al, |
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering | | | Changpinyo S et al, |
Compositional Generalization in Image Captioning | | | Nikolaus M et al, |
Awesome Image Captioning / Papers / 2020 |
MemCap: Memorizing Style Knowledge for Image Captioning | | | Zhao et al, |
Unified Vision-Language Pre-Training for Image Captioning and VQA | | | Zhou L et al, |
Show, Recall, and Tell: Image Captioning with Recall Mechanism | | | Wang L et al, |
Reinforcing an Image Caption Generator using Off-line Human Feedback | | | Hongsuck Seo P et al, |
Interactive Dual Generative Adversarial Networks for Image Captioning | | | Liu et al, |
Feature Deformation Meta-Networks in Image Captioning of Novel Objects | | | Cao et al, |
Joint Commonsense and Relation Reasoning for Image and Video Captioning | | | Hou et al, |
Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network
for Personalized Image Caption | | | Zhang et al, |
Normalized and Geometry-Aware Self-Attention Network for Image Captioning | | | Guo L et al, |
Object Relational Graph with Teacher-Recommended Learning for Video Captioning | | | Zhang Z et al, |
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs | | | Chen S et al, |
X-Linear Attention Networks for Image Captioning | | | Pan et al, |
Improving Image Captioning with Better Use of Caption | | | Shi Z et al, |
Cross-modal Coherence Modeling for Caption Generation | | | Alikhani M et al, |
Improving Image Captioning Evaluation by Considering Inter References Variance | | | Yi Y et al, |
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning | | | Lei J et al, |
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | | | Kim H et al, |
Length-Controllable Image Captioning | | | Deng C et al, |
Captioning Images Taken by People Who Are Blind | | | Gurari D et al, |
Towards Unique and Informative Captioning of Images | | | Wang Z et al, |
Learning Visual Representations with Caption Annotations | | | Sariyildiz M et al, |
Comprehensive Image Captioning via Scene Graph Decomposition | | | Zhong Y et al, |
SODA: Story Oriented Dense Video Captioning Evaluation Framework | | | Fujita S et al, |
TextCaps: a Dataset for Image Captioning with Reading Comprehension | | | Sidorov O et al, |
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets | | | Wang J et al, |
Learning to Generate Grounded Visual Captions without Localization Supervision | | | Ma C et al, |
Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards | | | Yang X et al, |
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos | | | Chen S et al, |
CapWAP: Image Captioning with a Purpose | | | Fisch A et al, |
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers | | | Cho J et al, |
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning | | | Fang Z et al, |
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements | | | Li Y et al, |
Diverse Image Captioning with Context-Object Split Latent Spaces | | | Mahajan S et al, |
RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning | | | Chiaro R et al, |
Awesome Image Captioning / Dataset |
nocaps | | | , LANG: |
MS COCO | | | , LANG: |
Flickr 8k | | | , LANG: |
Flickr 30k | | | , LANG: |
AI Challenger | | | , LANG: |
Visual Genome | | | , LANG: |
SBUCaptionedPhotoDataset | | | , LANG: |
IAPR TC-12 | | | , LANG: |
Awesome Image Captioning / Image Captioning Challenge |
Microsoft COCO Image Captioning | | | |
Google AI Blog: Conceptual Captions | | | |
Awesome Image Captioning / Popular Implementations / PyTorch |
ruotianluo/self-critical.pytorch | 997 | about 1 year ago | |
ruotianluo/ImageCaptioning.pytorch | 1,451 | about 1 year ago | |
jiasenlu/NeuralBabyTalk | 524 | over 5 years ago | |
Awesome Image Captioning / Popular Implementations / TensorFlow |
tensorflow/models/im2txt | 77,177 | 6 days ago | |
DeepRNN/image_captioning | 786 | over 2 years ago | |
Awesome Image Captioning / Popular Implementations / Torch |
jcjohnson/densecap | 1,584 | over 6 years ago | |
karpathy/neuraltalk2 | 5,511 | about 7 years ago | |
jiasenlu/AdaptiveAttention | 334 | almost 7 years ago | |
Awesome Image Captioning / Popular Implementations / Others |
emansim/text2image | 592 | almost 8 years ago | |
apple2373/chainer-caption | 64 | over 5 years ago | |
peteanderson80/bottom-up-attention | 1,433 | almost 2 years ago | |