| Awesome Papers / Multimodal Instruction Tuning | 
 | DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding |  |  |  | 
  | Github | 396 | 11 months ago |  | 
  | Apollo: An Exploration of Video Understanding in Large Multimodal Models |  |  |  | 
  | Github |  |  |  | 
  | Demo |  |  |  | 
  | InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | StreamChat: Chatting with Streaming Video |  |  |  | 
  | CompCap: Improving Multimodal Large Language Models with Composite Captions |  |  |  | 
  | LinVT: Empower Your Image-level Large Language Model to Understand Videos |  |  |  | 
  | Github | 13 | 11 months ago |  | 
  | Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |  |  |  | 
  | Github | 6,394 | 11 months ago |  | 
  | Demo |  |  |  | 
  | NVILA: Efficient Frontier Visual Language Models |  |  |  | 
  | Github | 2,146 | 11 months ago |  | 
  | Demo |  |  |  | 
  | T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs |  |  |  | 
  | Github | 44 | 11 months ago |  | 
  | TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability |  |  |  | 
  | Github | 67 | 11 months ago |  | 
  | ChatRex: Taming Multimodal LLM for Joint Perception and Understanding |  |  |  | 
  | Github | 106 | 11 months ago |  | 
  | LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding |  |  |  | 
  | Github | 329 | 12 months ago |  | 
  | Demo |  |  |  | 
  | Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate |  |  |  | 
  | Github | 89 | 11 months ago |  | 
  | AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark |  |  |  | 
  | Github | 57 | 12 months ago |  | 
  | Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models |  |  |  | 
  | Huggingface |  |  |  | 
  | Demo |  |  |  | 
  | Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution |  |  |  | 
  | Github | 3,613 | 11 months ago |  | 
  | Demo |  |  |  | 
  | LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture |  |  |  | 
  | Github | 183 | about 1 year ago |  | 
  | EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders |  |  |  | 
  | Github | 549 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation |  |  |  | 
  | Github | 69 | about 1 year ago |  | 
  | mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models |  |  |  | 
  | Github | 2,365 | 11 months ago |  | 
  | VITA: Towards Open-Source Interactive Omni Multimodal LLM |  |  |  | 
  | Github | 1,005 | about 1 year ago |  | 
  | LLaVA-OneVision: Easy Visual Task Transfer |  |  |  | 
  | Github | 3,099 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | MiniCPM-V: A GPT-4V Level MLLM on Your Phone |  |  |  | 
  | Github | 12,870 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | VILA^2: VILA Augmented VILA |  |  |  | 
  | SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models |  |  |  | 
  | EVLM: An Efficient Vision-Language Model for Visual Understanding |  |  |  | 
  | IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model |  |  |  | 
  | Github | 26 | 11 months ago |  | 
  | InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | Demo |  |  |  | 
  | OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding |  |  |  | 
  | Github | 1,336 | 11 months ago |  | 
  | DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming |  |  |  | 
  | Github | 9 | 11 months ago |  | 
  | Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs |  |  |  | 
  | Github | 1,799 | about 1 year ago |  | 
  | Long Context Transfer from Language to Vision |  |  |  | 
  | Github | 347 | 11 months ago |  | 
  | video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models |  |  |  | 
  | Github | 1,091 | 11 months ago |  | 
  | TroL: Traversal of Layers for Large Language and Vision Models |  |  |  | 
  | Github | 88 | over 1 year ago |  | 
  | Unveiling Encoder-Free Vision-Language Models |  |  |  | 
  | Github | 246 | about 1 year ago |  | 
  | VideoLLM-online: Online Video Large Language Model for Streaming Video |  |  |  | 
  | Github | 251 | about 1 year ago |  | 
  | RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics |  |  |  | 
  | Github | 64 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Comparison Visual Instruction Tuning |  |  |  | 
  | Github |  |  |  | 
  | Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models |  |  |  | 
  | Github | 143 | 12 months ago |  | 
  | VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs |  |  |  | 
  | Github | 957 | 11 months ago |  | 
  | Parrot: Multilingual Visual Instruction Tuning |  |  |  | 
  | Github | 34 | about 1 year ago |  | 
  | Ovis: Structural Embedding Alignment for Multimodal Large Language Model |  |  |  | 
  | Github | 575 | 11 months ago |  | 
  | Matryoshka Query Transformer for Large Vision-Language Models |  |  |  | 
  | Github | 101 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |  |  |  | 
  | Github | 106 | over 1 year ago |  | 
  | Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models |  |  |  | 
  | Github | 102 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Libra: Building Decoupled Vision System on Large Language Models |  |  |  | 
  | Github | 153 | 11 months ago |  | 
  | CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |  |  |  | 
  | Github | 136 | over 1 year ago |  | 
  | How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites |  |  |  | 
  | Github | 6,394 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Graphic Design with Large Multimodal Model |  |  |  | 
  | Github | 102 | over 1 year ago |  | 
  | BRAVE: Broadening the visual encoding of vision-language models |  |  |  | 
  | InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs |  |  |  | 
  | MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |  |  |  | 
  | Github | 254 | over 1 year ago |  | 
  | VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing |  |  |  | 
  | Github | 406 | about 1 year ago |  | 
  | TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model |  |  |  | 
  | LITA: Language Instructed Temporal-Localization Assistant |  |  |  | 
  | Github | 151 | about 1 year ago |  | 
  | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models |  |  |  | 
  | Github | 3,229 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training |  |  |  | 
  | MoAI: Mixture of All Intelligence for Large Language and Vision Models |  |  |  | 
  | Github | 314 | over 1 year ago |  | 
  | DeepSeek-VL: Towards Real-World Vision-Language Understanding |  |  |  | 
  | Github | 2,145 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document |  |  |  | 
  | Github | 1,849 | 11 months ago |  | 
  | Demo |  |  |  | 
  | The All-Seeing Project V2: Towards General Relation Comprehension of the Open World |  |  |  | 
  | Github | 466 | about 1 year ago |  | 
  | GROUNDHOG: Grounding Large Language Models to Holistic Segmentation |  |  |  | 
  | AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling |  |  |  | 
  | Github | 798 | about 1 year ago |  | 
  | Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning |  |  |  | 
  | Github | 58 | 11 months ago |  | 
  | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model |  |  |  | 
  | Github | 249 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | CoLLaVO: Crayon Large Language and Vision mOdel |  |  |  | 
  | Github | 93 | over 1 year ago |  | 
  | Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models |  |  |  | 
  | Github | 494 | over 1 year ago |  | 
  | CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations |  |  |  | 
  | Github | 153 | over 1 year ago |  | 
  | MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |  |  |  | 
  | Github | 1,076 | over 1 year ago |  | 
  | GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning |  |  |  | 
  | Github | 43 | 12 months ago |  | 
  | Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study |  |  |  | 
  | Coming soon |  |  |  | 
  | LLaVA-NeXT: Improved reasoning, OCR, and world knowledge |  |  |  | 
  | Github | 20,683 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | MoE-LLaVA: Mixture of Experts for Large Vision-Language Models |  |  |  | 
  | Github | 2,023 | 11 months ago |  | 
  | Demo |  |  |  | 
  | InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Yi-VL | 7,743 | 11 months ago |  | 
  | Github | 7,743 | 11 months ago |  | 
  | SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |  |  |  | 
  | ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning |  |  |  | 
  | Github | 108 | about 1 year ago |  | 
  | MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices |  |  |  | 
  | Github | 1,076 | over 1 year ago |  | 
  | InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |  |  |  | 
  | Github | 6,394 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Osprey: Pixel Understanding with Visual Instruction Tuning |  |  |  | 
  | Github | 781 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | CogAgent: A Visual Language Model for GUI Agents |  |  |  | 
  | Github | 6,182 | over 1 year ago |  | 
  | Coming soon |  |  |  | 
  | Pixel Aligned Language Models |  |  |  | 
  | Coming soon |  |  |  | 
  | VILA: On Pre-training for Visual Language Models |  |  |  | 
  | Github | 2,146 | 11 months ago |  | 
  | See, Say, and Segment: Teaching LMMs to Overcome False Premises |  |  |  | 
  | Coming soon |  |  |  | 
  | Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |  |  |  | 
  | Github | 1,831 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Honeybee: Locality-enhanced Projector for Multimodal LLM |  |  |  | 
  | Github | 435 | over 1 year ago |  | 
  | Gemini: A Family of Highly Capable Multimodal Models |  |  |  | 
  | OneLLM: One Framework to Align All Modalities with Language |  |  |  | 
  | Github | 601 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Lenna: Language Enhanced Reasoning Detection Assistant |  |  |  | 
  | Github | 78 | over 1 year ago |  | 
  | VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding |  |  |  | 
  | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding |  |  |  | 
  | Github | 314 | 12 months ago |  | 
  | Making Large Multimodal Models Understand Arbitrary Visual Prompts |  |  |  | 
  | Github | 302 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Dolphins: Multimodal Language Model for Driving |  |  |  | 
  | Github | 51 | over 1 year ago |  | 
  | LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning |  |  |  | 
  | Github | 255 | over 1 year ago |  | 
  | Coming soon |  |  |  | 
  | VTimeLLM: Empower LLM to Grasp Video Moments |  |  |  | 
  | Github | 231 | over 1 year ago |  | 
  | mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model |  |  |  | 
  | Github | 1,958 | about 1 year ago |  | 
  | LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |  |  |  | 
  | Github | 748 | over 1 year ago |  | 
  | Coming soon |  |  |  | 
  | LLMGA: Multimodal Large Language Model based Generation Assistant |  |  |  | 
  | Github | 463 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | ChartLlama: A Multimodal LLM for Chart Understanding and Generation |  |  |  | 
  | Github | 202 | almost 2 years ago |  | 
  | ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | Demo |  |  |  | 
  | LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |  |  |  | 
  | Github | 124 | over 1 year ago |  | 
  | An Embodied Generalist Agent in 3D World |  |  |  | 
  | Github | 379 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection |  |  |  | 
  | Github | 3,071 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding |  |  |  | 
  | Github | 895 | about 1 year ago |  | 
  | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning |  |  |  | 
  | Github | 131 | almost 2 years ago |  | 
  | SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models |  |  |  | 
  | Github | 2,732 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models |  |  |  | 
  | Github | 1,849 | 11 months ago |  | 
  | Demo |  |  |  | 
  | LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents |  |  |  | 
  | Github | 717 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | NExT-Chat: An LMM for Chat, Detection and Segmentation |  |  |  | 
  | Github | 227 | over 1 year ago |  | 
  | mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |  |  |  | 
  | Github | 2,365 | 11 months ago |  | 
  | Demo |  |  |  | 
  | OtterHD: A High-Resolution Multi-modality Model |  |  |  | 
  | Github | 3,570 | over 1 year ago |  | 
  | CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding |  |  |  | 
  | Coming soon |  |  |  | 
  | GLaMM: Pixel Grounding Large Multimodal Model |  |  |  | 
  | Github | 797 | 11 months ago |  | 
  | Demo |  |  |  | 
  | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning |  |  |  | 
  | Github | 18 | almost 2 years ago |  | 
  | MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |  |  |  | 
  | Github | 25,490 | about 1 year ago |  | 
  | SALMONN: Towards Generic Hearing Abilities for Large Language Models |  |  |  | 
  | Github | 1,091 | 11 months ago |  | 
  | Ferret: Refer and Ground Anything Anywhere at Any Granularity |  |  |  | 
  | Github | 8,509 | about 1 year ago |  | 
  | CogVLM: Visual Expert For Large Language Models |  |  |  | 
  | Github | 6,182 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Improved Baselines with Visual Instruction Tuning |  |  |  | 
  | Github | 20,683 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment |  |  |  | 
  | Github | 751 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs |  |  |  | 
  | Github | 79 | over 1 year ago |  | 
  | Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants |  |  |  | 
  | Github | 59 | over 1 year ago |  | 
  | AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |  |  |  | 
  | InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition |  |  |  | 
  | Github | 2,616 | 11 months ago |  | 
  | DreamLLM: Synergistic Multimodal Comprehension and Creation |  |  |  | 
  | Github | 402 | 11 months ago |  | 
  | Coming soon |  |  |  | 
  | An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models |  |  |  | 
  | Coming soon |  |  |  | 
  | TextBind: Multi-turn Interleaved Multimodal Instruction-following |  |  |  | 
  | Github | 47 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | NExT-GPT: Any-to-Any Multimodal LLM |  |  |  | 
  | Github | 3,344 | 12 months ago |  | 
  | Demo |  |  |  | 
  | Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics |  |  |  | 
  | Github | 19 | about 2 years ago |  | 
  | ImageBind-LLM: Multi-modality Instruction Tuning |  |  |  | 
  | Github | 5,775 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning |  |  |  | 
  | PointLLM: Empowering Large Language Models to Understand Point Clouds |  |  |  | 
  | Github | 670 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models |  |  |  | 
  | Github | 43 | over 1 year ago |  | 
  | MLLM-DataEngine: An Iterative Refinement Approach for MLLM |  |  |  | 
  | Github | 39 | over 1 year ago |  | 
  | Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models |  |  |  | 
  | Github | 37 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities |  |  |  | 
  | Github | 5,179 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages |  |  |  | 
  | Github | 1,098 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data |  |  |  | 
  | Github | 93 | almost 2 years ago |  | 
  | BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions |  |  |  | 
  | Github | 270 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions |  |  |  | 
  | Github | 360 | over 1 year ago |  | 
  | The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World |  |  |  | 
  | Github | 466 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | LISA: Reasoning Segmentation via Large Language Model |  |  |  | 
  | Github | 1,923 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding |  |  |  | 
  | Github | 550 | 11 months ago |  | 
  | 3D-LLM: Injecting the 3D World into Large Language Models |  |  |  | 
  | Github | 979 | over 1 year ago |  | 
  | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning |  |  |  | 
  | Demo |  |  |  | 
  | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs |  |  |  | 
  | Github | 505 | over 2 years ago |  | 
  | Demo |  |  |  | 
  | SVIT: Scaling up Visual Instruction Tuning |  |  |  | 
  | Github | 164 | over 1 year ago |  | 
  | GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest |  |  |  | 
  | Github | 517 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? |  |  |  | 
  | Github | 231 | about 2 years ago |  | 
  | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding |  |  |  | 
  | Github | 1,958 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Visual Instruction Tuning with Polite Flamingo |  |  |  | 
  | Github | 63 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding |  |  |  | 
  | Github | 259 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic |  |  |  | 
  | Github | 748 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MotionGPT: Human Motion as a Foreign Language |  |  |  | 
  | Github | 1,531 | over 1 year ago |  | 
  | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration |  |  |  | 
  | Github | 1,568 | over 1 year ago |  | 
  | Coming soon |  |  |  | 
  | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |  |  |  | 
  | Github | 305 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |  |  |  | 
  | Github | 1,246 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | MIMIC-IT: Multi-Modal In-Context Instruction Tuning |  |  |  | 
  | Github | 3,570 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |  |  |  | 
  | Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding |  |  |  | 
  | Github | 2,842 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day |  |  |  | 
  | Github | 1,622 | about 1 year ago |  | 
  | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction |  |  |  | 
  | Github | 762 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | PandaGPT: One Model To Instruction-Follow Them All |  |  |  | 
  | Github | 772 | over 2 years ago |  | 
  | Demo |  |  |  | 
  | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst |  |  |  | 
  | Github | 49 | about 2 years ago |  | 
  | Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models |  |  |  | 
  | Github | 513 | almost 2 years ago |  | 
  | DetGPT: Detect What You Need via Reasoning |  |  |  | 
  | Github | 761 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Pengi: An Audio Language Model for Audio Tasks |  |  |  | 
  | Github | 295 | over 1 year ago |  | 
  | VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks |  |  |  | 
  | Github | 956 | about 1 year ago |  | 
  | Listen, Think, and Understand |  |  |  | 
  | Github | 396 | over 1 year ago |  | 
  | Demo | 396 | over 1 year ago |  | 
  | Github | 4,110 | about 1 year ago |  | 
  | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering |  |  |  | 
  | Github | 180 | 11 months ago |  | 
  | InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |  |  |  | 
  | Github | 10,058 | 12 months ago |  | 
  | VideoChat: Chat-Centric Video Understanding |  |  |  | 
  | Github | 3,106 | 11 months ago |  | 
  | Demo |  |  |  | 
  | MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |  |  |  | 
  | Github | 1,478 | over 2 years ago |  | 
  | Demo |  |  |  | 
  | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages |  |  |  | 
  | Github | 308 | about 2 years ago |  | 
  | LMEye: An Interactive Perception Network for Large Language Models |  |  |  | 
  | Github | 48 | over 1 year ago |  | 
  | LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model |  |  |  | 
  | Github | 5,775 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |  |  |  | 
  | Github | 2,365 | 11 months ago |  | 
  | Demo |  |  |  | 
  | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |  |  |  | 
  | Github | 25,490 | about 1 year ago |  | 
  | Visual Instruction Tuning |  |  |  | 
  | GitHub | 20,683 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention |  |  |  | 
  | Github | 5,775 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning |  |  |  | 
  | Github | 134 | over 2 years ago |  | 
  | Awesome Papers / Multimodal Hallucination | 
 | Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models |  |  |  | 
  | Github | 28 | 11 months ago |  | 
  | Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations |  |  |  | 
  | Github | 46 | 12 months ago |  | 
  | FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs |  |  |  | 
  | Link |  |  |  | 
  | Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation |  |  |  | 
  | Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs |  |  |  | 
  | Github | 83 | 12 months ago |  | 
  | Evaluating and Analyzing Relationship Hallucinations in LVLMs |  |  |  | 
  | Github | 20 | about 1 year ago |  | 
  | AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention |  |  |  | 
  | Github | 18 | over 1 year ago |  | 
  | CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models |  |  |  | 
  | Coming soon |  |  |  | 
  | Mitigating Object Hallucination via Data Augmented Contrastive Tuning |  |  |  | 
  | Coming soon |  |  |  | 
  | VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap |  |  |  | 
  | Coming soon |  |  |  | 
  | Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback |  |  |  | 
  | Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding |  |  |  | 
  | What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models |  |  |  | 
  | Github | 15 | about 1 year ago |  | 
  | Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization |  |  |  | 
  | Debiasing Multimodal Large Language Models |  |  |  | 
  | Github | 75 | over 1 year ago |  | 
  | HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding |  |  |  | 
  | Github | 72 | 11 months ago |  | 
  | IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding |  |  |  | 
  | Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective |  |  |  | 
  | Github | 39 | about 1 year ago |  | 
  | Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models |  |  |  | 
  | Github | 19 | over 1 year ago |  | 
  | The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs |  |  |  | 
  | Github | 8 | over 1 year ago |  | 
  | Unified Hallucination Detection for Multimodal Large Language Models |  |  |  | 
  | Github | 48 | over 1 year ago |  | 
  | A Survey on Hallucination in Large Vision-Language Models |  |  |  | 
  | Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models |  |  |  | 
  | Hallucination Augmented Contrastive Learning for Multimodal Large Language Model |  |  |  | 
  | Github | 82 | almost 2 years ago |  | 
  | MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations |  |  |  | 
  | Github | 13 | about 1 year ago |  | 
  | Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites |  |  |  | 
  | Github | 8 | over 1 year ago |  | 
  | RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |  |  |  | 
  | Github | 245 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation |  |  |  | 
  | Github | 293 | about 1 year ago |  | 
  | Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding |  |  |  | 
  | Github | 222 | about 1 year ago |  | 
  | Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization |  |  |  | 
  | Github | 73 | almost 2 years ago |  | 
  | Comins Soon |  |  |  | 
  | Mitigating Hallucination in Visual Language Models with Visual Supervision |  |  |  | 
  | HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data |  |  |  | 
  | Github | 41 | over 1 year ago |  | 
  | An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation |  |  |  | 
  | Github | 98 | almost 2 years ago |  | 
  | FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models |  |  |  | 
  | Github | 27 | 12 months ago |  | 
  | Woodpecker: Hallucination Correction for Multimodal Large Language Models |  |  |  | 
  | Github | 617 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models |  |  |  | 
  | HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption |  |  |  | 
  | Github | 28 | over 1 year ago |  | 
  | Analyzing and Mitigating Object Hallucination in Large Vision-Language Models |  |  |  | 
  | Github | 136 | over 1 year ago |  | 
  | Aligning Large Multimodal Models with Factually Augmented RLHF |  |  |  | 
  | Github | 328 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Evaluation and Mitigation of Agnosia in Multimodal Large Language Models |  |  |  | 
  | CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning |  |  |  | 
  | Evaluation and Analysis of Hallucination in Large Vision-Language Models |  |  |  | 
  | Github | 17 | about 2 years ago |  | 
  | VIGC: Visual Instruction Generation and Correction |  |  |  | 
  | Github | 91 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Detecting and Preventing Hallucinations in Large Vision Language Models |  |  |  | 
  | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |  |  |  | 
  | Github | 262 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Evaluating Object Hallucination in Large Vision-Language Models |  |  |  | 
  | Github | 187 | over 1 year ago |  | 
  | Awesome Papers / Multimodal In-Context Learning | 
 | Visual In-Context Learning for Large Vision-Language Models |  |  |  | 
  | RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model |  |  |  | 
  | Github | 76 | about 1 year ago |  | 
  | Can MLLMs Perform Text-to-Image In-Context Learning? |  |  |  | 
  | Github | 30 | 12 months ago |  | 
  | Generative Multimodal Models are In-Context Learners |  |  |  | 
  | Github | 1,672 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Hijacking Context in Large Multi-modal Models |  |  |  | 
  | Towards More Unified In-context Visual Understanding |  |  |  | 
  | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |  |  |  | 
  | Github | 337 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Link-Context Learning for Multimodal LLMs |  |  |  | 
  | Github | 91 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models |  |  |  | 
  | Github | 3,781 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Med-Flamingo: a Multimodal Medical Few-shot Learner |  |  |  | 
  | Github | 396 | about 2 years ago |  | 
  | Generative Pretraining in Multimodality |  |  |  | 
  | Github | 1,672 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | AVIS: Autonomous Visual Information Seeking with Large Language Models |  |  |  | 
  | MIMIC-IT: Multi-Modal In-Context Instruction Tuning |  |  |  | 
  | Github | 3,570 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Exploring Diverse In-Context Configurations for Image Captioning |  |  |  | 
  | Github | 33 | 11 months ago |  | 
  | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |  |  |  | 
  | Github | 1,095 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace |  |  |  | 
  | Github | 23,801 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |  |  |  | 
  | Github | 940 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction |  |  |  | 
  | Github | 50 | about 2 years ago |  | 
  | Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering |  |  |  | 
  | Github | 270 | over 2 years ago |  | 
  | Visual Programming: Compositional visual reasoning without training |  |  |  | 
  | Github | 697 | about 1 year ago |  | 
  | An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA |  |  |  | 
  | Github | 85 | over 3 years ago |  | 
  | Flamingo: a Visual Language Model for Few-Shot Learning |  |  |  | 
  | Github | 3,781 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Multimodal Few-Shot Learning with Frozen Language Models |  |  |  | 
  | Awesome Papers / Multimodal Chain-of-Thought | 
 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models |  |  |  | 
  | Github | 113 | 11 months ago |  | 
  | Cantor: Inspiring Multimodal Chain-of-Thought of MLLM |  |  |  | 
  | Github | 73 | over 1 year ago |  | 
  | Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models |  |  |  | 
  | Github | 162 | 11 months ago |  | 
  | Compositional Chain-of-Thought Prompting for Large Multimodal Models |  |  |  | 
  | Github | 90 | over 1 year ago |  | 
  | DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models |  |  |  | 
  | Github | 35 | over 1 year ago |  | 
  | Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic |  |  |  | 
  | Github | 748 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Explainable Multimodal Emotion Reasoning |  |  |  | 
  | Github | 123 | over 1 year ago |  | 
  | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought |  |  |  | 
  | Github | 346 | over 1 year ago |  | 
  | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction |  |  |  | 
  | T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering |  |  |  | 
  | Caption Anything: Interactive Image Description with Diverse Multimodal Controls |  |  |  | 
  | Github | 1,693 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings |  |  |  | 
  | Coming soon |  |  |  | 
  | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |  |  |  | 
  | Github | 1,095 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Chain of Thought Prompt Tuning in Vision Language Models |  |  |  | 
  | Coming soon |  |  |  | 
  | MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |  |  |  | 
  | Github | 940 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models |  |  |  | 
  | Github | 34,555 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Multimodal Chain-of-Thought Reasoning in Language Models |  |  |  | 
  | Github | 3,833 | over 1 year ago |  | 
  | Visual Programming: Compositional visual reasoning without training |  |  |  | 
  | Github | 697 | about 1 year ago |  | 
  | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering |  |  |  | 
  | Github | 615 | about 1 year ago |  | 
  | Awesome Papers / LLM-Aided Visual Reasoning | 
 | Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models |  |  |  | 
  | Github | 14 | about 1 year ago |  | 
  | V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs |  |  |  | 
  | Github | 541 | almost 2 years ago |  | 
  | LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing |  |  |  | 
  | Github | 353 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MM-VID: Advancing Video Understanding with GPT-4V(vision) |  |  |  | 
  | ControlLLM: Augment Language Models with Tools by Searching on Graphs |  |  |  | 
  | Github | 187 | over 1 year ago |  | 
  | Woodpecker: Hallucination Correction for Multimodal Large Language Models |  |  |  | 
  | Github | 617 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | MindAgent: Emergent Gaming Interaction |  |  |  | 
  | Github | 79 | over 1 year ago |  | 
  | Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language |  |  |  | 
  | Github | 352 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models |  |  |  | 
  | AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn |  |  |  | 
  | Github | 66 | over 2 years ago |  | 
  | AVIS: Autonomous Visual Information Seeking with Large Language Models |  |  |  | 
  | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction |  |  |  | 
  | Github | 762 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Mindstorms in Natural Language-Based Societies of Mind |  |  |  | 
  | LayoutGPT: Compositional Visual Planning and Generation with Large Language Models |  |  |  | 
  | Github | 306 | over 1 year ago |  | 
  | IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models |  |  |  | 
  | Github | 32 | about 2 years ago |  | 
  | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation |  |  |  | 
  | Github | 7 | over 2 years ago |  | 
  | Caption Anything: Interactive Image Description with Diverse Multimodal Controls |  |  |  | 
  | Github | 1,693 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |  |  |  | 
  | Github | 1,095 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace |  |  |  | 
  | Github | 23,801 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |  |  |  | 
  | Github | 940 | over 1 year ago |  | 
  | Demo |  |  |  | 
  | ViperGPT: Visual Inference via Python Execution for Reasoning |  |  |  | 
  | Github | 1,666 | almost 2 years ago |  | 
  | ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions |  |  |  | 
  | Github | 457 | over 2 years ago |  | 
  | ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction |  |  |  | 
  | Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models |  |  |  | 
  | Github | 34,555 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners |  |  |  | 
  | Github | 41 | over 2 years ago |  | 
  | From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models |  |  |  | 
  | Github | 10,058 | 12 months ago |  | 
  | Demo |  |  |  | 
  | SuS-X: Training-Free Name-Only Transfer of Vision-Language Models |  |  |  | 
  | Github | 94 | about 2 years ago |  | 
  | PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning |  |  |  | 
  | Github | 235 | about 2 years ago |  | 
  | Visual Programming: Compositional visual reasoning without training |  |  |  | 
  | Github | 697 | about 1 year ago |  | 
  | Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language |  |  |  | 
  | Github | 34,478 | 11 months ago |  | 
  | Awesome Papers / Foundation Models | 
 | Emu3: Next-Token Prediction is All You Need |  |  |  | 
  | Github | 1,911 | about 1 year ago |  | 
  | Llama 3.2: Revolutionizing edge AI and vision with open, customizable models |  |  |  | 
  | Demo |  |  |  | 
  | Pixtral-12B |  |  |  | 
  | xGen-MM (BLIP-3): A Family of Open Large Multimodal Models |  |  |  | 
  | Github | 10,058 | 12 months ago |  | 
  | The Llama 3 Herd of Models |  |  |  | 
  | Chameleon: Mixed-Modal Early-Fusion Foundation Models |  |  |  | 
  | Hello GPT-4o |  |  |  | 
  | The Claude 3 Model Family: Opus, Sonnet, Haiku |  |  |  | 
  | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |  |  |  | 
  | Gemini: A Family of Highly Capable Multimodal Models |  |  |  | 
  | Fuyu-8B: A Multimodal Architecture for AI Agents |  |  |  | 
  | Huggingface |  |  |  | 
  | Demo |  |  |  | 
  | Unified Model for Image, Video, Audio and Language Tasks |  |  |  | 
  | Github | 224 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | PaLI-3 Vision Language Models: Smaller, Faster, Stronger |  |  |  | 
  | GPT-4V(ision) System Card |  |  |  | 
  | Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization |  |  |  | 
  | Github | 544 | about 1 year ago |  | 
  | Multimodal Foundation Models: From Specialists to General-Purpose Assistants |  |  |  | 
  | Bootstrapping Vision-Language Learning with Decoupled Language Pre-training |  |  |  | 
  | Github | 24 | almost 2 years ago |  | 
  | Generative Pretraining in Multimodality |  |  |  | 
  | Github | 1,672 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Kosmos-2: Grounding Multimodal Large Language Models to the World |  |  |  | 
  | Github | 20,400 | 11 months ago |  | 
  | Demo |  |  |  | 
  | Transfer Visual Prompt Generator across LLMs |  |  |  | 
  | Github | 270 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | GPT-4 Technical Report |  |  |  | 
  | PaLM-E: An Embodied Multimodal Language Model |  |  |  | 
  | Demo |  |  |  | 
  | Prismer: A Vision-Language Model with An Ensemble of Experts |  |  |  | 
  | Github | 1,299 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | Language Is Not All You Need: Aligning Perception with Language Models |  |  |  | 
  | Github | 20,400 | 11 months ago |  | 
  | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |  |  |  | 
  | Github | 10,058 | 12 months ago |  | 
  | Demo |  |  |  | 
  | VIMA: General Robot Manipulation with Multimodal Prompts |  |  |  | 
  | Github | 781 | over 1 year ago |  | 
  | MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge |  |  |  | 
  | Github | 1,843 | over 1 year ago |  | 
  | Write and Paint: Generative Vision-Language Models are Unified Modal Learners |  |  |  | 
  | Github | 43 | over 2 years ago |  | 
  | Language Models are General-Purpose Interfaces |  |  |  | 
  | Github | 20,400 | 11 months ago |  | 
  | Awesome Papers / Evaluation | 
 | MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective |  |  |  | 
  | Github | 106 | 11 months ago |  | 
  | OmniBench: Towards The Future of Universal Omni-Language Models |  |  |  | 
  | Github | 15 | 12 months ago |  | 
  | MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? |  |  |  | 
  | Github | 86 | 11 months ago |  | 
  | UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models |  |  |  | 
  | Github | 3 | about 1 year ago |  | 
  | MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation |  |  |  | 
  | Github | 22 | about 1 year ago |  | 
  | Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs |  |  |  | 
  | Github | 67 | about 1 year ago |  | 
  | CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs |  |  |  | 
  | Github | 85 | about 1 year ago |  | 
  | ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation |  |  |  | 
  | Github | 95 | over 1 year ago |  | 
  | Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis |  |  |  | 
  | Github | 422 | 11 months ago |  | 
  | Benchmarking Large Multimodal Models against Common Corruptions |  |  |  | 
  | Github | 27 | almost 2 years ago |  | 
  | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |  |  |  | 
  | Github | 296 | almost 2 years ago |  | 
  | A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise |  |  |  | 
  | Github | 13,117 | 11 months ago |  | 
  | BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models |  |  |  | 
  | Github | 84 | about 1 year ago |  | 
  | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |  |  |  | 
  | Github | 72 | almost 2 years ago |  | 
  | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs |  |  |  | 
  | Github | 24 | about 1 year ago |  | 
  | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V |  |  |  | 
  | Github | 56 | about 1 year ago |  | 
  | VLM-Eval: A General Evaluation on Video Large Language Models |  |  |  | 
  | Coming soon |  |  |  | 
  | Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges |  |  |  | 
  | Github | 53 | over 1 year ago |  | 
  | On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving |  |  |  | 
  | Github | 288 | over 1 year ago |  | 
  | Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead |  |  |  | 
  | A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging |  |  |  | 
  | An Early Evaluation of GPT-4V(ision) |  |  |  | 
  | Github | 11 | about 2 years ago |  | 
  | Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation |  |  |  | 
  | Github | 121 | almost 2 years ago |  | 
  | HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models |  |  |  | 
  | Github | 259 | 12 months ago |  | 
  | MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models |  |  |  | 
  | Github | 253 | 11 months ago |  | 
  | Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations |  |  |  | 
  | Github | 14 | about 2 years ago |  | 
  | Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning |  |  |  | 
  | Github | 21 | over 1 year ago |  | 
  | Can We Edit Multimodal Large Language Models? |  |  |  | 
  | Github | 1,981 | 11 months ago |  | 
  | REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets |  |  |  | 
  | Github | 11 | about 2 years ago |  | 
  | The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) |  |  |  | 
  | TouchStone: Evaluating Vision-Language Models by Language Models |  |  |  | 
  | Github | 79 | almost 2 years ago |  | 
  | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models |  |  |  | 
  | Github | 43 | over 1 year ago |  | 
  | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs |  |  |  | 
  | Github | 38 | about 1 year ago |  | 
  | Tiny LVLM-eHub: Early Multimodal Experiments with Bard |  |  |  | 
  | Github | 478 | over 1 year ago |  | 
  | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities |  |  |  | 
  | Github | 274 | 12 months ago |  | 
  | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension |  |  |  | 
  | Github | 322 | over 1 year ago |  | 
  | MMBench: Is Your Multi-modal Model an All-around Player? |  |  |  | 
  | Github | 168 | about 1 year ago |  | 
  | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models |  |  |  | 
  | Github | 13,117 | 11 months ago |  | 
  | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models |  |  |  | 
  | Github | 478 | over 1 year ago |  | 
  | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |  |  |  | 
  | Github | 305 | over 1 year ago |  | 
  | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models |  |  |  | 
  | Github | 93 | over 2 years ago |  | 
  | On The Hidden Mystery of OCR in Large Multimodal Models |  |  |  | 
  | Github | 484 | about 1 year ago |  | 
  | Awesome Papers / Multimodal RLHF | 
 | Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization |  |  |  | 
  | Silkie: Preference Distillation for Large Visual Language Models |  |  |  | 
  | Github | 88 | almost 2 years ago |  | 
  | RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |  |  |  | 
  | Github | 245 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Aligning Large Multimodal Models with Factually Augmented RLHF |  |  |  | 
  | Github | 328 | almost 2 years ago |  | 
  | Demo |  |  |  | 
  | RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data |  |  |  | 
  | Github | 2 | about 1 year ago |  | 
  | Awesome Papers / Others | 
 | TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models |  |  |  | 
  | Github | 7 | 11 months ago |  | 
  | Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models |  |  |  | 
  | Github | 47 | about 1 year ago |  | 
  | VCoder: Versatile Vision Encoders for Multimodal Large Language Models |  |  |  | 
  | Github | 266 | over 1 year ago |  | 
  | Prompt Highlighter: Interactive Control for Multi-Modal LLMs |  |  |  | 
  | Github | 135 | over 1 year ago |  | 
  | Planting a SEED of Vision in Large Language Model |  |  |  | 
  | Github | 585 | about 1 year ago |  | 
  | Can Large Pre-trained Models Help Vision Models on Perception Tasks? |  |  |  | 
  | Github | 1,218 | 12 months ago |  | 
  | Contextual Object Detection with Multimodal Large Language Models |  |  |  | 
  | Github | 208 | about 1 year ago |  | 
  | Demo |  |  |  | 
  | Generating Images with Multimodal Language Models |  |  |  | 
  | Github | 440 | almost 2 years ago |  | 
  | On Evaluating Adversarial Robustness of Large Vision-Language Models |  |  |  | 
  | Github | 165 | about 2 years ago |  | 
  | Grounding Language Models to Images for Multimodal Inputs and Outputs |  |  |  | 
  | Github | 478 | about 2 years ago |  | 
  | Demo |  |  |  | 
  | Awesome Datasets / Datasets of Pre-Training for Alignment | 
 | ShareGPT4Video: Improving Video Understanding and Generation with Better Captions |  |  |  | 
  | COYO-700M: Image-Text Pair Dataset | 1,172 | almost 3 years ago |  | 
  | ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |  |  |  | 
  | The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World |  |  |  | 
  | InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation |  |  |  | 
  | Microsoft COCO: Common Objects in Context |  |  |  | 
  | Im2Text: Describing Images Using 1 Million Captioned Photographs |  |  |  | 
  | Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning |  |  |  | 
  | LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs |  |  |  | 
  | Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations |  |  |  | 
  | Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models |  |  |  | 
  | AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding |  |  |  | 
  | Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark |  |  |  | 
  | Kosmos-2: Grounding Multimodal Large Language Models to the World |  |  |  | 
  | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks |  |  |  | 
  | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language |  |  |  | 
  | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval |  |  |  | 
  | WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research |  |  |  | 
  | AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline |  |  |  | 
  | AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale |  |  |  | 
  | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages |  |  |  | 
  | Awesome Datasets / Datasets of Multimodal Instruction Tuning | 
 | E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding |  |  |  | 
  | Link | 42 | 12 months ago |  | 
  | Multi-modal Situated Reasoning in 3D Scenes |  |  |  | 
  | Link |  |  |  | 
  | MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct |  |  |  | 
  | Link |  |  |  | 
  | UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models |  |  |  | 
  | Link | 3 | about 1 year ago |  | 
  | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models |  |  |  | 
  | Link | 33 | over 1 year ago |  | 
  | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model |  |  |  | 
  | Link |  |  |  | 
  | Visually Dehallucinative Instruction Generation: Know What You Don't Know |  |  |  | 
  | Link | 6 | over 1 year ago |  | 
  | Visually Dehallucinative Instruction Generation |  |  |  | 
  | Link | 5 | over 1 year ago |  | 
  | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts |  |  |  | 
  | Link | 58 | about 1 year ago |  | 
  | Making Large Multimodal Models Understand Arbitrary Visual Prompts |  |  |  | 
  | Link |  |  |  | 
  | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning |  |  |  | 
  | Link |  |  |  | 
  | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning |  |  |  | 
  | Link | 18 | almost 2 years ago |  | 
  | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models |  |  |  | 
  | Link | 43 | over 1 year ago |  | 
  | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data |  |  |  | 
  | Link | 93 | almost 2 years ago |  | 
  | Detecting and Preventing Hallucinations in Large Vision Language Models |  |  |  | 
  | Coming soon |  |  |  | 
  | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning |  |  |  | 
  | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs |  |  |  | 
  | Link |  |  |  | 
  | SVIT: Scaling up Visual Instruction Tuning |  |  |  | 
  | Link |  |  |  | 
  | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding |  |  |  | 
  | Link | 1,958 | about 1 year ago |  | 
  | Visual Instruction Tuning with Polite Flamingo |  |  |  | 
  | Link |  |  |  | 
  | ChartLlama: A Multimodal LLM for Chart Understanding and Generation |  |  |  | 
  | Link |  |  |  | 
  | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding |  |  |  | 
  | Link |  |  |  | 
  | MotionGPT: Human Motion as a Foreign Language |  |  |  | 
  | Link | 1,531 | over 1 year ago |  | 
  | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |  |  |  | 
  | Link | 262 | over 1 year ago |  | 
  | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration |  |  |  | 
  | Link | 1,568 | over 1 year ago |  | 
  | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |  |  |  | 
  | Link | 305 | over 1 year ago |  | 
  | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |  |  |  | 
  | Link | 1,246 | about 1 year ago |  | 
  | MIMIC-IT: Multi-Modal In-Context Instruction Tuning |  |  |  | 
  | Link | 3,570 | over 1 year ago |  | 
  | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |  |  |  | 
  | Link |  |  |  | 
  | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day |  |  |  | 
  | Coming soon | 1,622 | about 1 year ago |  | 
  | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction |  |  |  | 
  | Link | 762 | almost 2 years ago |  | 
  | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst |  |  |  | 
  | Coming soon |  |  |  | 
  | DetGPT: Detect What You Need via Reasoning |  |  |  | 
  | Link | 761 | about 1 year ago |  | 
  | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering |  |  |  | 
  | Coming soon |  |  |  | 
  | VideoChat: Chat-Centric Video Understanding |  |  |  | 
  | Link | 1,467 | 11 months ago |  | 
  | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages |  |  |  | 
  | Link | 308 | about 2 years ago |  | 
  | LMEye: An Interactive Perception Network for Large Language Models |  |  |  | 
  | Link |  |  |  | 
  | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |  |  |  | 
  | Link |  |  |  | 
  | Visual Instruction Tuning |  |  |  | 
  | Link |  |  |  | 
  | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning |  |  |  | 
  | Link | 134 | over 2 years ago |  | 
  | Awesome Datasets / Datasets of In-Context Learning | 
 | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |  |  |  | 
  | Link |  |  |  | 
  | MIMIC-IT: Multi-Modal In-Context Instruction Tuning |  |  |  | 
  | Link | 3,570 | over 1 year ago |  | 
  | Awesome Datasets / Datasets of Multimodal Chain-of-Thought | 
 | Explainable Multimodal Emotion Reasoning |  |  |  | 
  | Coming soon | 123 | over 1 year ago |  | 
  | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought |  |  |  | 
  | Coming soon | 346 | over 1 year ago |  | 
  | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction |  |  |  | 
  | Coming soon |  |  |  | 
  | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering |  |  |  | 
  | Link | 615 | about 1 year ago |  | 
  | Awesome Datasets / Datasets of Multimodal RLHF | 
 | Silkie: Preference Distillation for Large Visual Language Models |  |  |  | 
  | Link |  |  |  | 
  | Awesome Datasets / Benchmarks for Evaluation | 
 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought |  |  |  | 
  | Link | 47 | over 1 year ago |  | 
  | MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective |  |  |  | 
  | Link | 106 | 11 months ago |  | 
  | MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps |  |  |  | 
  | Link | 3 | 12 months ago |  | 
  | LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content |  |  |  | 
  | Link |  |  |  | 
  | TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models |  |  |  | 
  | Link |  |  |  | 
  | OmniBench: Towards The Future of Universal Omni-Language Models |  |  |  | 
  | Link |  |  |  | 
  | MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? |  |  |  | 
  | Link |  |  |  | 
  | VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? |  |  |  | 
  | Link | 5 | about 1 year ago |  | 
  | Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions |  |  |  | 
  | Link | 43 | about 1 year ago |  | 
  | CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs |  |  |  | 
  | Link |  |  |  | 
  | Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis |  |  |  | 
  | Link | 422 | 11 months ago |  | 
  | VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning |  |  |  | 
  | Link | 31 | over 1 year ago |  | 
  | TempCompass: Do Video LLMs Really Understand Videos? |  |  |  | 
  | Link | 91 | 12 months ago |  | 
  | GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning |  |  |  | 
  | Link |  |  |  | 
  | Can MLLMs Perform Text-to-Image In-Context Learning? |  |  |  | 
  | Link |  |  |  | 
  | Visually Dehallucinative Instruction Generation: Know What You Don't Know |  |  |  | 
  | Link | 6 | over 1 year ago |  | 
  | Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset |  |  |  | 
  | Link | 74 | about 1 year ago |  | 
  | SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval |  |  |  | 
  | Link | 22 | about 1 year ago |  | 
  | CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark |  |  |  | 
  | Link | 46 | about 1 year ago |  | 
  | Benchmarking Large Multimodal Models against Common Corruptions |  |  |  | 
  | Link | 27 | almost 2 years ago |  | 
  | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |  |  |  | 
  | Link | 296 | almost 2 years ago |  | 
  | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding |  |  |  | 
  | Link |  |  |  | 
  | Making Large Multimodal Models Understand Arbitrary Visual Prompts |  |  |  | 
  | Link |  |  |  | 
  | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts |  |  |  | 
  | Link | 58 | about 1 year ago |  | 
  | Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models |  |  |  | 
  | Link | 121 | almost 2 years ago |  | 
  | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs |  |  |  | 
  | Link | 24 | about 1 year ago |  | 
  | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V |  |  |  | 
  | Link | 56 | about 1 year ago |  | 
  | BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models |  |  |  | 
  | Link |  |  |  | 
  | MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning |  |  |  | 
  | Link | 87 | about 1 year ago |  | 
  | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |  |  |  | 
  | Link | 3,106 | 11 months ago |  | 
  | Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges |  |  |  | 
  | Link | 53 | over 1 year ago |  | 
  | OtterHD: A High-Resolution Multi-modality Model |  |  |  | 
  | Link |  |  |  | 
  | HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models |  |  |  | 
  | Link | 259 | 12 months ago |  | 
  | Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond |  |  |  | 
  | Link | 99 | over 1 year ago |  | 
  | Aligning Large Multimodal Models with Factually Augmented RLHF |  |  |  | 
  | Link |  |  |  | 
  | MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models |  |  |  | 
  | Link |  |  |  | 
  | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models |  |  |  | 
  | Link | 43 | over 1 year ago |  | 
  | Link-Context Learning for Multimodal LLMs |  |  |  | 
  | Link |  |  |  | 
  | Detecting and Preventing Hallucinations in Large Vision Language Models |  |  |  | 
  | Coming soon |  |  |  | 
  | Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions |  |  |  | 
  | Link | 360 | over 1 year ago |  | 
  | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs |  |  |  | 
  | Link | 38 | about 1 year ago |  | 
  | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities |  |  |  | 
  | Link | 274 | 12 months ago |  | 
  | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension |  |  |  | 
  | Link | 322 | over 1 year ago |  | 
  | MMBench: Is Your Multi-modal Model an All-around Player? |  |  |  | 
  | Link | 168 | about 1 year ago |  | 
  | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? |  |  |  | 
  | Link | 231 | about 2 years ago |  | 
  | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |  |  |  | 
  | Link | 262 | over 1 year ago |  | 
  | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models |  |  |  | 
  | Link | 13,117 | 11 months ago |  | 
  | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models |  |  |  | 
  | Link | 478 | over 1 year ago |  | 
  | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |  |  |  | 
  | Link | 305 | over 1 year ago |  | 
  | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models |  |  |  | 
  | Link | 93 | over 2 years ago |  | 
  | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |  |  |  | 
  | Link | 2,365 | 11 months ago |  | 
  | Awesome Datasets / Others | 
 | IMAD: IMage-Augmented multi-modal Dialogue |  |  |  | 
  | Link | 4 | over 2 years ago |  | 
  | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |  |  |  | 
  | Link | 1,246 | about 1 year ago |  | 
  | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation |  |  |  | 
  | Link |  |  |  | 
  | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation |  |  |  | 
  | Link |  |  |  | 
  | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? |  |  |  | 
  | Link |  |  |  | 
  | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities |  |  |  | 
  | Link |  |  |  |