Awesome Image Captioning / Change Log |
| here | 56 | over 3 years ago | May 25 An up-to-date paper list about vision-and-language pre-training is available |
Awesome Image Captioning / Papers / Survey |
| A Comprehensive Survey of Deep Learning for Image Captioning | | | Hossain M et al, |
Awesome Image Captioning / Papers / Before |
| I2t: Image parsing to text description | | | Yao B Z et al, |
| Im2Text: Describing Images Using 1 Million Captioned Photographs | | | Ordonez V et al, |
| Deep Captioning with Multimodal Recurrent Neural Networks | | | Mao J et al, |
Awesome Image Captioning / Papers / 2015 |
| Show and Tell: A Neural Image Caption Generator | | | Vinyals O et al, |
| Deep Visual-Semantic Alignments for Generating Image Descriptions | | | Karpathy A et al, |
| Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation | | | Chen X et al, |
| Long-term Recurrent Convolutional Networks for Visual Recognition and Description | | | Donahue J et al, |
| Guiding the Long-Short Term Memory Model for Image Caption Generation | | | Jia X et al, |
| Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images | | | Mao J et al, |
| Expressing an Image Stream with a Sequence of Natural Sentences | | | Park C C et al, |
| Show, Attend and Tell: Neural Image Caption Generation with Visual Attention | | | Xu K et al, |
| Order-Embeddings of Images and Language | | | Vendrov I et al, |
| Generating Images from Captions with Attention | | | Mansimov E et al, |
| Learning FRAME Models Using CNN Filters for Knowledge Visualization | | | Lu Y, et al, |
| Aligning where to see and what to tell: image caption with region-based attention and scene factorization | | | Jin J et al, |
Awesome Image Captioning / Papers / 2016 |
| Image captioning with semantic attention | | | You Q et al, |
| DenseCap: Fully Convolutional Localization Networks for Dense Captioning | | | Johnson J et al, |
| What value do explicit high level concepts have in vision to language problems? | | | Wu Q et al, |
| Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data | | | Lisa Anne Hendricks et al, |
| SPICE: Semantic Propositional Image Caption Evaluation | | | Anderson P et al, |
| Image Captioning with Deep Bidirectional LSTMs | | | Wang C et al, |
| Multimodal Pivots for Image Caption Translation | | | Hitschler J et al, |
| Image Caption Generation with Text-Conditional Semantic Attention | | | Zhou L et al, |
| DeepDiary: Automatic Caption Generation for Lifelogging Image Streams | | | Fan C et al, |
| Learning to generalize to new compositions in image understanding | | | Atzmon Y et al, |
| Generating captions without looking beyond objects | | | Heuer H et al, |
| Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning | | | Chen W et al, |
| Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering | | | Liu H et al, |
| Recurrent Highway Networks with Language CNN for Image Captioning | | | Gu J et al, |
Awesome Image Captioning / Papers / 2017 |
| Captioning Images with Diverse Objects | | | Venugopalan S et al, |
| Top-down Visual Saliency Guided by Captions | | | Ramanishka V et al, |
| Self-Critical Sequence Training for Image Captioning | | | Steven J et al, |
| Dense Captioning with Joint Inference and Visual Context | | | Yang L et al, |
| Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition | | | Yufei W et al, |
| A Hierarchical Approach for Generating Descriptive Image Paragraphs | | | Krause J et al, |
| Deep Reinforcement Learning-based Image Captioning with Embedding Reward | | | Ren Z et al, |
| Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects | | | Ting Y et al, |
| Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning | | | Lu J et al, |
| Attend to You: Personalized Image Captioning with Context Sequence Memory Networks | | | CC Park et al, |
| SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning | | | Chen L et al, |
| Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning | | | Qing S et al, |
| Areas of Attention for Image Captioning | | | Pedersoli M et al, |
| Boosting Image Captioning with Attributes | | | Yao T et al, |
| An Empirical Study of Language CNN for Image Captioning | | | Gu J et al, |
| Improved Image Captioning via Policy Gradient Optimization of SPIDEr | | | Liu S et al, |
| Towards Diverse and Natural Image Descriptions via a Conditional GAN | | | Dai B et al, |
| Paying Attention to Descriptions Generated by Image Captioning Models | | | Tavakoliy H R et al, |
| Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner | | | Chen T H et al, |
| Image Caption with Global-Local Attention | | | Li L et al, |
| Reference Based LSTM for Image Captioning | | | Chen M et al, |
| Attention Correctness in Neural Image Captioning | | | Liu C et al, |
| Text-guided Attention Model for Image Captioning | | | Mun J et al, |
| Contrastive Learning for Image Captioning | | | Dai B et al, |
| Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge | | | Vinyals O et al, |
| MAT: A Multimodal Attentive Translator for Image Captioning | | | Liu C et al, |
| Actor-Critic Sequence Training for Image Captioning | | | Zhang L et al, |
| What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? | | | Tanti M et al, |
| Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning | | | Xian Y et al, |
| Phrase-based Image Captioning with Hierarchical LSTM Model | | | Tan Y H et al, |
| Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning | | | Chen H et al, |
Awesome Image Captioning / Papers / 2018 |
| Neural Baby Talk | | | Lu J et al, |
| Convolutional Image Captioning | | | Aneja J et al, |
| Learning to Evaluate Image Captioning | | | Cui Y et al, |
| Discriminability Objective for Training Descriptive Captions | | | Luo R et al, |
| SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text | | | Mathews A et al, |
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | | | Anderson P et al, |
| GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints | | | Chen F et al, |
| Unpaired Image Captioning by Language Pivoting | | | Gu J et al, |
| Recurrent Fusion Network for Image Captioning | | | Jiang W et al, |
| Exploring Visual Relationship for Image Captioning | | | Yao T et al, |
| Rethinking the Form of Latent States in Image Captioning | | | Dai B et al, |
| Boosted Attention: Leveraging Human Attention for Image Captioning | | | Chen S et al, |
| "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention | | | Chen T et al, |
| Learning to Guide Decoding for Image Captioning | | | Jiang W et al, |
| Stack-Captioning: Coarse-to-Fine Learning for Image Captioning | | | Gu J et al, |
| Temporal-difference Learning with Sampling Baseline for Image Captioning | | | Chen H et al, |
| Partially-Supervised Image Captioning | | | Anderson P et al, |
| A Neural Compositional Paradigm for Image Captioning | | | Dai B et al, |
| Defoiling Foiled Image Captions | | | Wang J et al, |
| Punny Captions: Witty Wordplay in Image Descriptions | | | Chandrasekaran A et al, |
| Object Counts! Bringing Explicit Detections Back into Image Captioning | | | Aneja J et al, |
| Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | Sharma P et al, |
| Attacking visual language grounding with adversarial examples: A case study on neural image captioning | | | Chen H et al, |
| simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions | | | Liu et al, |
| Improved Image Captioning with Adversarial Semantic Alignment | | | Melnyk I et al, |
| Improving Image Captioning with Conditional Generative Adversarial Nets | | | Chen C et al, |
| CNN+CNN: Convolutional Decoders for Image Captioning | | | Wang Q et al, |
| Diverse and Controllable Image Captioning with Part-of-Speech Guidance | | | Deshpande A et al, |
Awesome Image Captioning / Papers / 2019 |
| Unsupervised Image Captioning | | | Yang F et al, |
| Engaging Image Captioning Via Personality | | | Shuster K et al, |
| Pointing Novel Objects in Image Captioning | | | Li Y et al, |
| Auto-Encoding Scene Graphs for Image Captioning | | | Yang X et al, |
| Context and Attribute Grounded Dense Captioning | | | Yin G et al, |
| Look Back and Predict Forward in Image Captioning | | | Qin Y et al, |
| Self-critical n-step Training for Image Captioning | | | Gao J et al, |
| Intention Oriented Image Captions with Guiding Objects | | | Zheng Y et al, |
| Describing like humans: on diversity in image captioning | | | Wang Q et al, |
| Adversarial Semantic Alignment for Improved Image Captions | | | Dognin P et al, |
| MSCap: Multi-Style Image Captioning With Unpaired Stylized Text | | | Gao L et al, |
| Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech | | | Aditya D et al, |
| Good News, Everyone! Context driven entity-aware captioning for news images | | | Biten A F et al, |
| CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection | 50 | almost 6 years ago | Zhang L et al, |
| Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning | | | Kim D et al, |
| Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions | | | Cornia M et al, |
| Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables | | | Xu Y et al, |
| Meta Learning for Image Captioning | | | Li N et al, |
| Learning Object Context for Dense Captioning | | | Li X et al, |
| Hierarchical Attention Network for Image Captioning | | | Wang W et al, |
| Deliberate Residual based Attention Network for Image Captioning | | | Gao L et al, |
| Improving Image Captioning with Conditional Generative Adversarial Nets | | | Chen C et al, |
| Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding | | | Song L et al, |
| Dense Procedure Captioning in Narrated Instructional Videos | | | Shi B et al, |
| Informative Image Captioning with External Sources of Information | | | Zhao S et al, |
| Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning | | | Fan Z et al, |
| Image Captioning with Unseen Objects | | | Demirel et al, |
| Look and Modify: Modification Networks for Image Captioning | | | Sammani et al, |
| Show, Infer and Tell: Contextual Inference for Creative Captioning | | | Khare et al, |
| SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward | | | Yan et al, |
| Hierarchy Parsing for Image Captioning | | | Yao T et al, |
| Entangled Transformer for Image Captioning | | | Li G et al, |
| Attention on Attention for Image Captioning | | | Huang L et al, |
| Reflective Decoding Network for Image Captioning | | | Ke L at al, |
| Learning to Collocate Neural Modules for Image Captioning | | | Yang X et al, |
| Image Captioning: Transforming Objects into Words | | | Herdade S et al, |
| Adaptively Aligned Image Captioning via Adaptive Attention Time | | | Huang L et al, |
| Variational Structured Semantic Inference for Diverse Image Captioning | | | Chen F et al, |
| Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations | | | Liu F et al, |
| Image Captioning with Compositional Neural Module Networks | | | Tian J et al, |
| Exploring and Distilling Cross-Modal Information for Image Captioning | | | Liu F et al, |
| Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization | | | Wang H et al, |
| Hornet: a hierarchical offshoot recurrent network for improving person re-ID via image captioning | | | Yan S et al, |
| Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach | | | Kim D J et al, |
| TIGEr: Text-to-Image Grounding for Image Caption Evaluation | | | Jiang M et al, |
| REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning | | | Jiang M et al, |
| Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering | | | Changpinyo S et al, |
| Compositional Generalization in Image Captioning | | | Nikolaus M et al, |
Awesome Image Captioning / Papers / 2020 |
| MemCap: Memorizing Style Knowledge for Image Captioning | | | Zhao et al, |
| Unified Vision-Language Pre-Training for Image Captioning and VQA | | | Zhou L et al, |
| Show, Recall, and Tell: Image Captioning with Recall Mechanism | | | Wang L et al, |
| Reinforcing an Image Caption Generator using Off-line Human Feedback | | | Hongsuck Seo P et al, |
| Interactive Dual Generative Adversarial Networks for Image Captioning | | | Liu et al, |
| Feature Deformation Meta-Networks in Image Captioning of Novel Objects | | | Cao et al, |
| Joint Commonsense and Relation Reasoning for Image and Video Captioning | | | Hou et al, |
| Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network
for Personalized Image Caption | | | Zhang et al, |
| Normalized and Geometry-Aware Self-Attention Network for Image Captioning | | | Guo L et al, |
| Object Relational Graph with Teacher-Recommended Learning for Video Captioning | | | Zhang Z et al, |
| Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs | | | Chen S et al, |
| X-Linear Attention Networks for Image Captioning | | | Pan et al, |
| Improving Image Captioning with Better Use of Caption | | | Shi Z et al, |
| Cross-modal Coherence Modeling for Caption Generation | | | Alikhani M et al, |
| Improving Image Captioning Evaluation by Considering Inter References Variance | | | Yi Y et al, |
| MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning | | | Lei J et al, |
| Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | | | Kim H et al, |
| Length-Controllable Image Captioning | | | Deng C et al, |
| Captioning Images Taken by People Who Are Blind | | | Gurari D et al, |
| Towards Unique and Informative Captioning of Images | | | Wang Z et al, |
| Learning Visual Representations with Caption Annotations | | | Sariyildiz M et al, |
| Comprehensive Image Captioning via Scene Graph Decomposition | | | Zhong Y et al, |
| SODA: Story Oriented Dense Video Captioning Evaluation Framework | | | Fujita S et al, |
| TextCaps: a Dataset for Image Captioning with Reading Comprehension | | | Sidorov O et al, |
| Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets | | | Wang J et al, |
| Learning to Generate Grounded Visual Captions without Localization Supervision | | | Ma C et al, |
| Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards | | | Yang X et al, |
| Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos | | | Chen S et al, |
| CapWAP: Image Captioning with a Purpose | | | Fisch A et al, |
| X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers | | | Cho J et al, |
| Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning | | | Fang Z et al, |
| Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements | | | Li Y et al, |
| Diverse Image Captioning with Context-Object Split Latent Spaces | | | Mahajan S et al, |
| RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning | | | Chiaro R et al, |
Awesome Image Captioning / Dataset |
| nocaps | | | , LANG: |
| MS COCO | | | , LANG: |
| Flickr 8k | | | , LANG: |
| Flickr 30k | | | , LANG: |
| AI Challenger | | | , LANG: |
| Visual Genome | | | , LANG: |
| SBUCaptionedPhotoDataset | | | , LANG: |
| IAPR TC-12 | | | , LANG: |
Awesome Image Captioning / Image Captioning Challenge |
| Microsoft COCO Image Captioning | | | |
| Google AI Blog: Conceptual Captions | | | |
Awesome Image Captioning / Popular Implementations / PyTorch |
| ruotianluo/self-critical.pytorch | 998 | about 2 years ago | |
| ruotianluo/ImageCaptioning.pytorch | 1,458 | about 2 years ago | |
| jiasenlu/NeuralBabyTalk | 525 | over 6 years ago | |
Awesome Image Captioning / Popular Implementations / TensorFlow |
| tensorflow/models/im2txt | 77,258 | 12 months ago | |
| DeepRNN/image_captioning | 790 | over 3 years ago | |
Awesome Image Captioning / Popular Implementations / Torch |
| jcjohnson/densecap | 1,584 | over 7 years ago | |
| karpathy/neuraltalk2 | 5,515 | about 8 years ago | |
| jiasenlu/AdaptiveAttention | 335 | almost 8 years ago | |
Awesome Image Captioning / Popular Implementations / Others |
| emansim/text2image | 594 | almost 9 years ago | |
| apple2373/chainer-caption | 64 | over 6 years ago | |
| peteanderson80/bottom-up-attention | 1,438 | almost 3 years ago | |