Awesome Papers / Multimodal Instruction Tuning |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | | | |
| Github | 396 | 10 months ago | |
| Apollo: An Exploration of Video Understanding in Large Multimodal Models | | | |
| Github | | | |
| Demo | | | |
| InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions | | | |
| Github | 2,616 | 11 months ago | |
| StreamChat: Chatting with Streaming Video | | | |
| CompCap: Improving Multimodal Large Language Models with Composite Captions | | | |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | | | |
| Github | 13 | 10 months ago | |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | | | |
| Github | 6,394 | 11 months ago | |
| Demo | | | |
| NVILA: Efficient Frontier Visual Language Models | | | |
| Github | 2,146 | 11 months ago | |
| Demo | | | |
| T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | | | |
| Github | 44 | 10 months ago | |
| TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | | | |
| Github | 67 | 11 months ago | |
| ChatRex: Taming Multimodal LLM for Joint Perception and Understanding | | | |
| Github | 106 | 11 months ago | |
| LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | | | |
| Github | 329 | 12 months ago | |
| Demo | | | |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | | | |
| Github | 89 | 11 months ago | |
| AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | | | |
| Github | 57 | 11 months ago | |
| Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models | | | |
| Huggingface | | | |
| Demo | | | |
| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | | | |
| Github | 3,613 | 11 months ago | |
| Demo | | | |
| LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture | | | |
| Github | 183 | about 1 year ago | |
| EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | | | |
| Github | 549 | about 1 year ago | |
| Demo | | | |
| LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | | | |
| Github | 69 | about 1 year ago | |
| mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | | | |
| Github | 2,365 | 11 months ago | |
| VITA: Towards Open-Source Interactive Omni Multimodal LLM | | | |
| Github | 1,005 | about 1 year ago | |
| LLaVA-OneVision: Easy Visual Task Transfer | | | |
| Github | 3,099 | about 1 year ago | |
| Demo | | | |
| MiniCPM-V: A GPT-4V Level MLLM on Your Phone | | | |
| Github | 12,870 | about 1 year ago | |
| Demo | | | |
| VILA^2: VILA Augmented VILA | | | |
| SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | | | |
| EVLM: An Efficient Vision-Language Model for Visual Understanding | | | |
| IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model | | | |
| Github | 26 | 11 months ago | |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | | | |
| Github | 2,616 | 11 months ago | |
| Demo | | | |
| OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | | | |
| Github | 1,336 | 11 months ago | |
| DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming | | | |
| Github | 9 | 11 months ago | |
| Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | | | |
| Github | 1,799 | 12 months ago | |
| Long Context Transfer from Language to Vision | | | |
| Github | 347 | 11 months ago | |
| video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | | | |
| Github | 1,091 | 11 months ago | |
| TroL: Traversal of Layers for Large Language and Vision Models | | | |
| Github | 88 | over 1 year ago | |
| Unveiling Encoder-Free Vision-Language Models | | | |
| Github | 246 | about 1 year ago | |
| VideoLLM-online: Online Video Large Language Model for Streaming Video | | | |
| Github | 251 | about 1 year ago | |
| RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics | | | |
| Github | 64 | about 1 year ago | |
| Demo | | | |
| Comparison Visual Instruction Tuning | | | |
| Github | | | |
| Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models | | | |
| Github | 143 | 12 months ago | |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | | | |
| Github | 957 | 11 months ago | |
| Parrot: Multilingual Visual Instruction Tuning | | | |
| Github | 34 | about 1 year ago | |
| Ovis: Structural Embedding Alignment for Multimodal Large Language Model | | | |
| Github | 575 | 11 months ago | |
| Matryoshka Query Transformer for Large Vision-Language Models | | | |
| Github | 101 | over 1 year ago | |
| Demo | | | |
| ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | | | |
| Github | 106 | about 1 year ago | |
| Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | | | |
| Github | 102 | over 1 year ago | |
| Demo | | | |
| Libra: Building Decoupled Vision System on Large Language Models | | | |
| Github | 153 | 11 months ago | |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | | | |
| Github | 136 | over 1 year ago | |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | | | |
| Github | 6,394 | 11 months ago | |
| Demo | | | |
| Graphic Design with Large Multimodal Model | | | |
| Github | 102 | over 1 year ago | |
| BRAVE: Broadening the visual encoding of vision-language models | | | |
| InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | | | |
| Github | 2,616 | 11 months ago | |
| Demo | | | |
| Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | | | |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | | | |
| Github | 254 | over 1 year ago | |
| VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing | | | |
| Github | 406 | about 1 year ago | |
| TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model | | | |
| LITA: Language Instructed Temporal-Localization Assistant | | | |
| Github | 151 | about 1 year ago | |
| Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | | | |
| Github | 3,229 | over 1 year ago | |
| Demo | | | |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | | |
| MoAI: Mixture of All Intelligence for Large Language and Vision Models | | | |
| Github | 314 | over 1 year ago | |
| DeepSeek-VL: Towards Real-World Vision-Language Understanding | | | |
| Github | 2,145 | over 1 year ago | |
| Demo | | | |
| TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | | | |
| Github | 1,849 | 11 months ago | |
| Demo | | | |
| The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | | | |
| Github | 466 | about 1 year ago | |
| GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | | |
| AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | | | |
| Github | 798 | about 1 year ago | |
| Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning | | | |
| Github | 58 | 11 months ago | |
| ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
| Github | 249 | over 1 year ago | |
| Demo | | | |
| CoLLaVO: Crayon Large Language and Vision mOdel | | | |
| Github | 93 | over 1 year ago | |
| Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models | | | |
| Github | 494 | over 1 year ago | |
| CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | | | |
| Github | 153 | over 1 year ago | |
| MobileVLM V2: Faster and Stronger Baseline for Vision Language Model | | | |
| Github | 1,076 | over 1 year ago | |
| GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning | | | |
| Github | 43 | 11 months ago | |
| Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study | | | |
| Coming soon | | | |
| LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | | | |
| Github | 20,683 | about 1 year ago | |
| Demo | | | |
| MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | | | |
| Github | 2,023 | 11 months ago | |
| Demo | | | |
| InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | | | |
| Github | 2,616 | 11 months ago | |
| Demo | | | |
| Yi-VL | 7,743 | 11 months ago | |
| Github | 7,743 | 11 months ago | |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | | | |
| ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning | | | |
| Github | 108 | about 1 year ago | |
| MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices | | | |
| Github | 1,076 | over 1 year ago | |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | | | |
| Github | 6,394 | 11 months ago | |
| Demo | | | |
| Osprey: Pixel Understanding with Visual Instruction Tuning | | | |
| Github | 781 | about 1 year ago | |
| Demo | | | |
| CogAgent: A Visual Language Model for GUI Agents | | | |
| Github | 6,182 | over 1 year ago | |
| Coming soon | | | |
| Pixel Aligned Language Models | | | |
| Coming soon | | | |
| VILA: On Pre-training for Visual Language Models | | | |
| Github | 2,146 | 11 months ago | |
| See, Say, and Segment: Teaching LMMs to Overcome False Premises | | | |
| Coming soon | | | |
| Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | | | |
| Github | 1,831 | 11 months ago | |
| Demo | | | |
| Honeybee: Locality-enhanced Projector for Multimodal LLM | | | |
| Github | 435 | over 1 year ago | |
| Gemini: A Family of Highly Capable Multimodal Models | | | |
| OneLLM: One Framework to Align All Modalities with Language | | | |
| Github | 601 | about 1 year ago | |
| Demo | | | |
| Lenna: Language Enhanced Reasoning Detection Assistant | | | |
| Github | 78 | over 1 year ago | |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | | | |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
| Github | 314 | 11 months ago | |
| Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
| Github | 302 | over 1 year ago | |
| Demo | | | |
| Dolphins: Multimodal Language Model for Driving | | | |
| Github | 51 | over 1 year ago | |
| LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | | | |
| Github | 255 | over 1 year ago | |
| Coming soon | | | |
| VTimeLLM: Empower LLM to Grasp Video Moments | | | |
| Github | 231 | over 1 year ago | |
| mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | | | |
| Github | 1,958 | about 1 year ago | |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | | | |
| Github | 748 | about 1 year ago | |
| Coming soon | | | |
| LLMGA: Multimodal Large Language Model based Generation Assistant | | | |
| Github | 463 | about 1 year ago | |
| Demo | | | |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
| Github | 202 | almost 2 years ago | |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
| Github | 2,616 | 11 months ago | |
| Demo | | | |
| LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | | | |
| Github | 124 | over 1 year ago | |
| An Embodied Generalist Agent in 3D World | | | |
| Github | 379 | about 1 year ago | |
| Demo | | | |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | | | |
| Github | 3,071 | 11 months ago | |
| Demo | | | |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | | | |
| Github | 895 | about 1 year ago | |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
| Github | 131 | almost 2 years ago | |
| SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | | | |
| Github | 2,732 | over 1 year ago | |
| Demo | | | |
| Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | | | |
| Github | 1,849 | 11 months ago | |
| Demo | | | |
| LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | | | |
| Github | 717 | over 1 year ago | |
| Demo | | | |
| NExT-Chat: An LMM for Chat, Detection and Segmentation | | | |
| Github | 227 | over 1 year ago | |
| mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | | | |
| Github | 2,365 | 11 months ago | |
| Demo | | | |
| OtterHD: A High-Resolution Multi-modality Model | | | |
| Github | 3,570 | over 1 year ago | |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | | | |
| Coming soon | | | |
| GLaMM: Pixel Grounding Large Multimodal Model | | | |
| Github | 797 | 11 months ago | |
| Demo | | | |
| What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
| Github | 18 | almost 2 years ago | |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | | | |
| Github | 25,490 | about 1 year ago | |
| SALMONN: Towards Generic Hearing Abilities for Large Language Models | | | |
| Github | 1,091 | 11 months ago | |
| Ferret: Refer and Ground Anything Anywhere at Any Granularity | | | |
| Github | 8,509 | about 1 year ago | |
| CogVLM: Visual Expert For Large Language Models | | | |
| Github | 6,182 | over 1 year ago | |
| Demo | | | |
| Improved Baselines with Visual Instruction Tuning | | | |
| Github | 20,683 | about 1 year ago | |
| Demo | | | |
| LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | | | |
| Github | 751 | over 1 year ago | |
| Demo | | | |
| Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | | | |
| Github | 79 | over 1 year ago | |
| Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants | | | |
| Github | 59 | over 1 year ago | |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | | | |
| InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | | | |
| Github | 2,616 | 11 months ago | |
| DreamLLM: Synergistic Multimodal Comprehension and Creation | | | |
| Github | 402 | 11 months ago | |
| Coming soon | | | |
| An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models | | | |
| Coming soon | | | |
| TextBind: Multi-turn Interleaved Multimodal Instruction-following | | | |
| Github | 47 | about 2 years ago | |
| Demo | | | |
| NExT-GPT: Any-to-Any Multimodal LLM | | | |
| Github | 3,344 | 12 months ago | |
| Demo | | | |
| Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics | | | |
| Github | 19 | about 2 years ago | |
| ImageBind-LLM: Multi-modality Instruction Tuning | | | |
| Github | 5,775 | over 1 year ago | |
| Demo | | | |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | | | |
| PointLLM: Empowering Large Language Models to Understand Point Clouds | | | |
| Github | 670 | 12 months ago | |
| Demo | | | |
| ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
| Github | 43 | over 1 year ago | |
| MLLM-DataEngine: An Iterative Refinement Approach for MLLM | | | |
| Github | 39 | over 1 year ago | |
| Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models | | | |
| Github | 37 | about 2 years ago | |
| Demo | | | |
| Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities | | | |
| Github | 5,179 | about 1 year ago | |
| Demo | | | |
| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | | | |
| Github | 1,098 | over 1 year ago | |
| Demo | | | |
| StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
| Github | 93 | almost 2 years ago | |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions | | | |
| Github | 270 | over 1 year ago | |
| Demo | | | |
| Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | | | |
| Github | 360 | over 1 year ago | |
| The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
| Github | 466 | about 1 year ago | |
| Demo | | | |
| LISA: Reasoning Segmentation via Large Language Model | | | |
| Github | 1,923 | over 1 year ago | |
| Demo | | | |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | | | |
| Github | 550 | 11 months ago | |
| 3D-LLM: Injecting the 3D World into Large Language Models | | | |
| Github | 979 | over 1 year ago | |
| ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
| Demo | | | |
| BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
| Github | 505 | over 2 years ago | |
| Demo | | | |
| SVIT: Scaling up Visual Instruction Tuning | | | |
| Github | 164 | over 1 year ago | |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | | | |
| Github | 517 | over 1 year ago | |
| Demo | | | |
| What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
| Github | 231 | about 2 years ago | |
| mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
| Github | 1,958 | about 1 year ago | |
| Demo | | | |
| Visual Instruction Tuning with Polite Flamingo | | | |
| Github | 63 | almost 2 years ago | |
| Demo | | | |
| LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
| Github | 259 | over 1 year ago | |
| Demo | | | |
| Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
| Github | 748 | over 1 year ago | |
| Demo | | | |
| MotionGPT: Human Motion as a Foreign Language | | | |
| Github | 1,531 | over 1 year ago | |
| Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
| Github | 1,568 | over 1 year ago | |
| Coming soon | | | |
| LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
| Github | 305 | over 1 year ago | |
| Demo | | | |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
| Github | 1,246 | about 1 year ago | |
| Demo | | | |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
| Github | 3,570 | over 1 year ago | |
| Demo | | | |
| M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | | | |
| Github | 2,842 | over 1 year ago | |
| Demo | | | |
| LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
| Github | 1,622 | about 1 year ago | |
| GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
| Github | 762 | almost 2 years ago | |
| Demo | | | |
| PandaGPT: One Model To Instruction-Follow Them All | | | |
| Github | 772 | over 2 years ago | |
| Demo | | | |
| ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
| Github | 49 | about 2 years ago | |
| Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models | | | |
| Github | 513 | over 1 year ago | |
| DetGPT: Detect What You Need via Reasoning | | | |
| Github | 761 | about 1 year ago | |
| Demo | | | |
| Pengi: An Audio Language Model for Audio Tasks | | | |
| Github | 295 | over 1 year ago | |
| VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks | | | |
| Github | 956 | about 1 year ago | |
| Listen, Think, and Understand | | | |
| Github | 396 | over 1 year ago | |
| Demo | 396 | over 1 year ago | |
| Github | 4,110 | about 1 year ago | |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
| Github | 180 | 11 months ago | |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | | | |
| Github | 10,058 | 11 months ago | |
| VideoChat: Chat-Centric Video Understanding | | | |
| Github | 3,106 | 11 months ago | |
| Demo | | | |
| MultiModal-GPT: A Vision and Language Model for Dialogue with Humans | | | |
| Github | 1,478 | over 2 years ago | |
| Demo | | | |
| X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
| Github | 308 | about 2 years ago | |
| LMEye: An Interactive Perception Network for Large Language Models | | | |
| Github | 48 | over 1 year ago | |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | | | |
| Github | 5,775 | over 1 year ago | |
| Demo | | | |
| mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
| Github | 2,365 | 11 months ago | |
| Demo | | | |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
| Github | 25,490 | about 1 year ago | |
| Visual Instruction Tuning | | | |
| GitHub | 20,683 | about 1 year ago | |
| Demo | | | |
| LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention | | | |
| Github | 5,775 | over 1 year ago | |
| Demo | | | |
| MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
| Github | 134 | over 2 years ago | |
Awesome Papers / Multimodal Hallucination |
| Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models | | | |
| Github | 28 | 11 months ago | |
| Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations | | | |
| Github | 46 | 11 months ago | |
| FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs | | | |
| Link | | | |
| Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation | | | |
| Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs | | | |
| Github | 83 | 12 months ago | |
| Evaluating and Analyzing Relationship Hallucinations in LVLMs | | | |
| Github | 20 | about 1 year ago | |
| AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention | | | |
| Github | 18 | over 1 year ago | |
| CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models | | | |
| Coming soon | | | |
| Mitigating Object Hallucination via Data Augmented Contrastive Tuning | | | |
| Coming soon | | | |
| VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap | | | |
| Coming soon | | | |
| Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback | | | |
| Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | | | |
| What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models | | | |
| Github | 15 | about 1 year ago | |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | | |
| Debiasing Multimodal Large Language Models | | | |
| Github | 75 | over 1 year ago | |
| HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding | | | |
| Github | 72 | 11 months ago | |
| IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding | | | |
| Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective | | | |
| Github | 39 | 12 months ago | |
| Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models | | | |
| Github | 19 | over 1 year ago | |
| The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs | | | |
| Github | 8 | over 1 year ago | |
| Unified Hallucination Detection for Multimodal Large Language Models | | | |
| Github | 48 | over 1 year ago | |
| A Survey on Hallucination in Large Vision-Language Models | | | |
| Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models | | | |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | | | |
| Github | 82 | over 1 year ago | |
| MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations | | | |
| Github | 13 | about 1 year ago | |
| Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites | | | |
| Github | 8 | over 1 year ago | |
| RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
| Github | 245 | about 1 year ago | |
| Demo | | | |
| OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation | | | |
| Github | 293 | about 1 year ago | |
| Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding | | | |
| Github | 222 | about 1 year ago | |
| Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization | | | |
| Github | 73 | over 1 year ago | |
| Comins Soon | | | |
| Mitigating Hallucination in Visual Language Models with Visual Supervision | | | |
| HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data | | | |
| Github | 41 | over 1 year ago | |
| An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | | | |
| Github | 98 | almost 2 years ago | |
| FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models | | | |
| Github | 27 | 12 months ago | |
| Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
| Github | 617 | over 1 year ago | |
| Demo | | | |
| Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | | | |
| HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption | | | |
| Github | 28 | over 1 year ago | |
| Analyzing and Mitigating Object Hallucination in Large Vision-Language Models | | | |
| Github | 136 | over 1 year ago | |
| Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
| Github | 328 | almost 2 years ago | |
| Demo | | | |
| Evaluation and Mitigation of Agnosia in Multimodal Large Language Models | | | |
| CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning | | | |
| Evaluation and Analysis of Hallucination in Large Vision-Language Models | | | |
| Github | 17 | about 2 years ago | |
| VIGC: Visual Instruction Generation and Correction | | | |
| Github | 91 | over 1 year ago | |
| Demo | | | |
| Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
| Github | 262 | over 1 year ago | |
| Demo | | | |
| Evaluating Object Hallucination in Large Vision-Language Models | | | |
| Github | 187 | over 1 year ago | |
Awesome Papers / Multimodal In-Context Learning |
| Visual In-Context Learning for Large Vision-Language Models | | | |
| RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model | | | |
| Github | 76 | about 1 year ago | |
| Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
| Github | 30 | 12 months ago | |
| Generative Multimodal Models are In-Context Learners | | | |
| Github | 1,672 | about 1 year ago | |
| Demo | | | |
| Hijacking Context in Large Multi-modal Models | | | |
| Towards More Unified In-context Visual Understanding | | | |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
| Github | 337 | almost 2 years ago | |
| Demo | | | |
| Link-Context Learning for Multimodal LLMs | | | |
| Github | 91 | over 1 year ago | |
| Demo | | | |
| OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | | | |
| Github | 3,781 | about 1 year ago | |
| Demo | | | |
| Med-Flamingo: a Multimodal Medical Few-shot Learner | | | |
| Github | 396 | about 2 years ago | |
| Generative Pretraining in Multimodality | | | |
| Github | 1,672 | about 1 year ago | |
| Demo | | | |
| AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
| Github | 3,570 | over 1 year ago | |
| Demo | | | |
| Exploring Diverse In-Context Configurations for Image Captioning | | | |
| Github | 33 | 11 months ago | |
| Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
| Github | 1,095 | almost 2 years ago | |
| Demo | | | |
| HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
| Github | 23,801 | about 1 year ago | |
| Demo | | | |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
| Github | 940 | over 1 year ago | |
| Demo | | | |
| ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
| Github | 50 | about 2 years ago | |
| Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | | | |
| Github | 270 | over 2 years ago | |
| Visual Programming: Compositional visual reasoning without training | | | |
| Github | 697 | about 1 year ago | |
| An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | | | |
| Github | 85 | over 3 years ago | |
| Flamingo: a Visual Language Model for Few-Shot Learning | | | |
| Github | 3,781 | about 1 year ago | |
| Demo | | | |
| Multimodal Few-Shot Learning with Frozen Language Models | | | |
Awesome Papers / Multimodal Chain-of-Thought |
| Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models | | | |
| Github | 113 | 11 months ago | |
| Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | | | |
| Github | 73 | over 1 year ago | |
| Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models | | | |
| Github | 162 | 11 months ago | |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | | | |
| Github | 90 | over 1 year ago | |
| DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | | | |
| Github | 35 | over 1 year ago | |
| Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
| Github | 748 | over 1 year ago | |
| Demo | | | |
| Explainable Multimodal Emotion Reasoning | | | |
| Github | 123 | over 1 year ago | |
| EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
| Github | 346 | over 1 year ago | |
| Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
| T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering | | | |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
| Github | 1,693 | about 2 years ago | |
| Demo | | | |
| Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings | | | |
| Coming soon | | | |
| Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
| Github | 1,095 | almost 2 years ago | |
| Demo | | | |
| Chain of Thought Prompt Tuning in Vision Language Models | | | |
| Coming soon | | | |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
| Github | 940 | over 1 year ago | |
| Demo | | | |
| Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
| Github | 34,555 | almost 2 years ago | |
| Demo | | | |
| Multimodal Chain-of-Thought Reasoning in Language Models | | | |
| Github | 3,833 | over 1 year ago | |
| Visual Programming: Compositional visual reasoning without training | | | |
| Github | 697 | about 1 year ago | |
| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
| Github | 615 | about 1 year ago | |
Awesome Papers / LLM-Aided Visual Reasoning |
| Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models | | | |
| Github | 14 | about 1 year ago | |
| V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs | | | |
| Github | 541 | almost 2 years ago | |
| LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | | | |
| Github | 353 | about 1 year ago | |
| Demo | | | |
| MM-VID: Advancing Video Understanding with GPT-4V(vision) | | | |
| ControlLLM: Augment Language Models with Tools by Searching on Graphs | | | |
| Github | 187 | over 1 year ago | |
| Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
| Github | 617 | over 1 year ago | |
| Demo | | | |
| MindAgent: Emergent Gaming Interaction | | | |
| Github | 79 | over 1 year ago | |
| Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language | | | |
| Github | 352 | almost 2 years ago | |
| Demo | | | |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | | | |
| AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn | | | |
| Github | 66 | over 2 years ago | |
| AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
| GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
| Github | 762 | almost 2 years ago | |
| Demo | | | |
| Mindstorms in Natural Language-Based Societies of Mind | | | |
| LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | | | |
| Github | 306 | over 1 year ago | |
| IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models | | | |
| Github | 32 | about 2 years ago | |
| Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
| Github | 7 | over 2 years ago | |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
| Github | 1,693 | about 2 years ago | |
| Demo | | | |
| Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
| Github | 1,095 | almost 2 years ago | |
| Demo | | | |
| HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
| Github | 23,801 | about 1 year ago | |
| Demo | | | |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
| Github | 940 | over 1 year ago | |
| Demo | | | |
| ViperGPT: Visual Inference via Python Execution for Reasoning | | | |
| Github | 1,666 | over 1 year ago | |
| ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | | | |
| Github | 457 | over 2 years ago | |
| ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
| Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
| Github | 34,555 | almost 2 years ago | |
| Demo | | | |
| Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners | | | |
| Github | 41 | over 2 years ago | |
| From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | | | |
| Github | 10,058 | 11 months ago | |
| Demo | | | |
| SuS-X: Training-Free Name-Only Transfer of Vision-Language Models | | | |
| Github | 94 | about 2 years ago | |
| PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning | | | |
| Github | 235 | about 2 years ago | |
| Visual Programming: Compositional visual reasoning without training | | | |
| Github | 697 | about 1 year ago | |
| Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | | | |
| Github | 34,478 | 11 months ago | |
Awesome Papers / Foundation Models |
| Emu3: Next-Token Prediction is All You Need | | | |
| Github | 1,911 | about 1 year ago | |
| Llama 3.2: Revolutionizing edge AI and vision with open, customizable models | | | |
| Demo | | | |
| Pixtral-12B | | | |
| xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | | | |
| Github | 10,058 | 11 months ago | |
| The Llama 3 Herd of Models | | | |
| Chameleon: Mixed-Modal Early-Fusion Foundation Models | | | |
| Hello GPT-4o | | | |
| The Claude 3 Model Family: Opus, Sonnet, Haiku | | | |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | | | |
| Gemini: A Family of Highly Capable Multimodal Models | | | |
| Fuyu-8B: A Multimodal Architecture for AI Agents | | | |
| Huggingface | | | |
| Demo | | | |
| Unified Model for Image, Video, Audio and Language Tasks | | | |
| Github | 224 | almost 2 years ago | |
| Demo | | | |
| PaLI-3 Vision Language Models: Smaller, Faster, Stronger | | | |
| GPT-4V(ision) System Card | | | |
| Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | | | |
| Github | 544 | about 1 year ago | |
| Multimodal Foundation Models: From Specialists to General-Purpose Assistants | | | |
| Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | | | |
| Github | 24 | almost 2 years ago | |
| Generative Pretraining in Multimodality | | | |
| Github | 1,672 | about 1 year ago | |
| Demo | | | |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
| Github | 20,400 | 10 months ago | |
| Demo | | | |
| Transfer Visual Prompt Generator across LLMs | | | |
| Github | 270 | about 2 years ago | |
| Demo | | | |
| GPT-4 Technical Report | | | |
| PaLM-E: An Embodied Multimodal Language Model | | | |
| Demo | | | |
| Prismer: A Vision-Language Model with An Ensemble of Experts | | | |
| Github | 1,299 | almost 2 years ago | |
| Demo | | | |
| Language Is Not All You Need: Aligning Perception with Language Models | | | |
| Github | 20,400 | 10 months ago | |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | | | |
| Github | 10,058 | 11 months ago | |
| Demo | | | |
| VIMA: General Robot Manipulation with Multimodal Prompts | | | |
| Github | 781 | over 1 year ago | |
| MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge | | | |
| Github | 1,843 | over 1 year ago | |
| Write and Paint: Generative Vision-Language Models are Unified Modal Learners | | | |
| Github | 43 | over 2 years ago | |
| Language Models are General-Purpose Interfaces | | | |
| Github | 20,400 | 10 months ago | |
Awesome Papers / Evaluation |
| MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective | | | |
| Github | 106 | 11 months ago | |
| OmniBench: Towards The Future of Universal Omni-Language Models | | | |
| Github | 15 | 12 months ago | |
| MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
| Github | 86 | 11 months ago | |
| UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
| Github | 3 | about 1 year ago | |
| MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | | | |
| Github | 22 | about 1 year ago | |
| Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs | | | |
| Github | 67 | about 1 year ago | |
| CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
| Github | 85 | about 1 year ago | |
| ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | | | |
| Github | 95 | over 1 year ago | |
| Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
| Github | 422 | 10 months ago | |
| Benchmarking Large Multimodal Models against Common Corruptions | | | |
| Github | 27 | almost 2 years ago | |
| Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
| Github | 296 | over 1 year ago | |
| A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | | | |
| Github | 13,117 | 10 months ago | |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
| Github | 84 | about 1 year ago | |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | | | |
| Github | 72 | almost 2 years ago | |
| Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
| Github | 24 | about 1 year ago | |
| MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
| Github | 56 | about 1 year ago | |
| VLM-Eval: A General Evaluation on Video Large Language Models | | | |
| Coming soon | | | |
| Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
| Github | 53 | over 1 year ago | |
| On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving | | | |
| Github | 288 | over 1 year ago | |
| Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead | | | |
| A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging | | | |
| An Early Evaluation of GPT-4V(ision) | | | |
| Github | 11 | almost 2 years ago | |
| Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation | | | |
| Github | 121 | almost 2 years ago | |
| HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
| Github | 259 | 11 months ago | |
| MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
| Github | 253 | 11 months ago | |
| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | | | |
| Github | 14 | almost 2 years ago | |
| Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning | | | |
| Github | 21 | over 1 year ago | |
| Can We Edit Multimodal Large Language Models? | | | |
| Github | 1,981 | 10 months ago | |
| REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets | | | |
| Github | 11 | about 2 years ago | |
| The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) | | | |
| TouchStone: Evaluating Vision-Language Models by Language Models | | | |
| Github | 79 | almost 2 years ago | |
| ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
| Github | 43 | over 1 year ago | |
| SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
| Github | 38 | 12 months ago | |
| Tiny LVLM-eHub: Early Multimodal Experiments with Bard | | | |
| Github | 478 | over 1 year ago | |
| MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
| Github | 274 | 12 months ago | |
| SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
| Github | 322 | over 1 year ago | |
| MMBench: Is Your Multi-modal Model an All-around Player? | | | |
| Github | 168 | about 1 year ago | |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
| Github | 13,117 | 10 months ago | |
| LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
| Github | 478 | over 1 year ago | |
| LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
| Github | 305 | over 1 year ago | |
| M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
| Github | 93 | over 2 years ago | |
| On The Hidden Mystery of OCR in Large Multimodal Models | | | |
| Github | 484 | about 1 year ago | |
Awesome Papers / Multimodal RLHF |
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | | | |
| Silkie: Preference Distillation for Large Visual Language Models | | | |
| Github | 88 | almost 2 years ago | |
| RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
| Github | 245 | about 1 year ago | |
| Demo | | | |
| Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
| Github | 328 | almost 2 years ago | |
| Demo | | | |
| RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data | | | |
| Github | 2 | about 1 year ago | |
Awesome Papers / Others |
| TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | | | |
| Github | 7 | 11 months ago | |
| Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | | | |
| Github | 47 | about 1 year ago | |
| VCoder: Versatile Vision Encoders for Multimodal Large Language Models | | | |
| Github | 266 | over 1 year ago | |
| Prompt Highlighter: Interactive Control for Multi-Modal LLMs | | | |
| Github | 135 | over 1 year ago | |
| Planting a SEED of Vision in Large Language Model | | | |
| Github | 585 | about 1 year ago | |
| Can Large Pre-trained Models Help Vision Models on Perception Tasks? | | | |
| Github | 1,218 | 12 months ago | |
| Contextual Object Detection with Multimodal Large Language Models | | | |
| Github | 208 | about 1 year ago | |
| Demo | | | |
| Generating Images with Multimodal Language Models | | | |
| Github | 440 | almost 2 years ago | |
| On Evaluating Adversarial Robustness of Large Vision-Language Models | | | |
| Github | 165 | almost 2 years ago | |
| Grounding Language Models to Images for Multimodal Inputs and Outputs | | | |
| Github | 478 | almost 2 years ago | |
| Demo | | | |
Awesome Datasets / Datasets of Pre-Training for Alignment |
| ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | | | |
| COYO-700M: Image-Text Pair Dataset | 1,172 | almost 3 years ago | |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
| The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | | | |
| Microsoft COCO: Common Objects in Context | | | |
| Im2Text: Describing Images Using 1 Million Captioned Photographs | | | |
| Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | |
| LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs | | | |
| Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | | | |
| Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | | | |
| AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding | | | |
| Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | | | |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | | | |
| MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | | | |
| Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | | | |
| WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research | | | |
| AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline | | | |
| AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | | | |
| X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Awesome Datasets / Datasets of Multimodal Instruction Tuning |
| E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | | | |
| Link | 42 | 12 months ago | |
| Multi-modal Situated Reasoning in 3D Scenes | | | |
| Link | | | |
| MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct | | | |
| Link | | | |
| UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
| Link | 3 | about 1 year ago | |
| VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | | | |
| Link | 33 | over 1 year ago | |
| ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
| Link | | | |
| Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
| Link | 6 | over 1 year ago | |
| Visually Dehallucinative Instruction Generation | | | |
| Link | 5 | over 1 year ago | |
| M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
| Link | 58 | about 1 year ago | |
| Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
| Link | | | |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
| Link | | | |
| What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
| Link | 18 | almost 2 years ago | |
| ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
| Link | 43 | over 1 year ago | |
| StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
| Link | 93 | almost 2 years ago | |
| Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
| Coming soon | | | |
| ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
| BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
| Link | | | |
| SVIT: Scaling up Visual Instruction Tuning | | | |
| Link | | | |
| mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
| Link | 1,958 | about 1 year ago | |
| Visual Instruction Tuning with Polite Flamingo | | | |
| Link | | | |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
| Link | | | |
| LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
| Link | | | |
| MotionGPT: Human Motion as a Foreign Language | | | |
| Link | 1,531 | over 1 year ago | |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
| Link | 262 | over 1 year ago | |
| Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
| Link | 1,568 | over 1 year ago | |
| LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
| Link | 305 | over 1 year ago | |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
| Link | 1,246 | about 1 year ago | |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
| Link | 3,570 | over 1 year ago | |
| M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
| Link | | | |
| LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
| Coming soon | 1,622 | about 1 year ago | |
| GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
| Link | 762 | almost 2 years ago | |
| ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
| Coming soon | | | |
| DetGPT: Detect What You Need via Reasoning | | | |
| Link | 761 | about 1 year ago | |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
| Coming soon | | | |
| VideoChat: Chat-Centric Video Understanding | | | |
| Link | 1,467 | 11 months ago | |
| X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
| Link | 308 | about 2 years ago | |
| LMEye: An Interactive Perception Network for Large Language Models | | | |
| Link | | | |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
| Link | | | |
| Visual Instruction Tuning | | | |
| Link | | | |
| MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
| Link | 134 | over 2 years ago | |
Awesome Datasets / Datasets of In-Context Learning |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
| Link | | | |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
| Link | 3,570 | over 1 year ago | |
Awesome Datasets / Datasets of Multimodal Chain-of-Thought |
| Explainable Multimodal Emotion Reasoning | | | |
| Coming soon | 123 | over 1 year ago | |
| EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
| Coming soon | 346 | over 1 year ago | |
| Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
| Coming soon | | | |
| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
| Link | 615 | about 1 year ago | |
Awesome Datasets / Datasets of Multimodal RLHF |
| Silkie: Preference Distillation for Large Visual Language Models | | | |
| Link | | | |
Awesome Datasets / Benchmarks for Evaluation |
| M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought | | | |
| Link | 47 | over 1 year ago | |
| MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective | | | |
| Link | 106 | 11 months ago | |
| MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps | | | |
| Link | 3 | 12 months ago | |
| LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content | | | |
| Link | | | |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | | | |
| Link | | | |
| OmniBench: Towards The Future of Universal Omni-Language Models | | | |
| Link | | | |
| MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
| Link | | | |
| VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? | | | |
| Link | 5 | about 1 year ago | |
| Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions | | | |
| Link | 43 | about 1 year ago | |
| CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
| Link | | | |
| Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
| Link | 422 | 10 months ago | |
| VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning | | | |
| Link | 31 | over 1 year ago | |
| TempCompass: Do Video LLMs Really Understand Videos? | | | |
| Link | 91 | 11 months ago | |
| GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning | | | |
| Link | | | |
| Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
| Link | | | |
| Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
| Link | 6 | over 1 year ago | |
| Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | | | |
| Link | 74 | about 1 year ago | |
| SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval | | | |
| Link | 22 | about 1 year ago | |
| CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | | | |
| Link | 46 | about 1 year ago | |
| Benchmarking Large Multimodal Models against Common Corruptions | | | |
| Link | 27 | almost 2 years ago | |
| Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
| Link | 296 | over 1 year ago | |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
| Link | | | |
| Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
| Link | | | |
| M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
| Link | 58 | about 1 year ago | |
| Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | | | |
| Link | 121 | almost 2 years ago | |
| Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
| Link | 24 | about 1 year ago | |
| MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
| Link | 56 | about 1 year ago | |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
| Link | | | |
| MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | | | |
| Link | 87 | about 1 year ago | |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | | | |
| Link | 3,106 | 11 months ago | |
| Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
| Link | 53 | over 1 year ago | |
| OtterHD: A High-Resolution Multi-modality Model | | | |
| Link | | | |
| HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
| Link | 259 | 11 months ago | |
| Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | | | |
| Link | 99 | over 1 year ago | |
| Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
| Link | | | |
| MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
| Link | | | |
| ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
| Link | 43 | over 1 year ago | |
| Link-Context Learning for Multimodal LLMs | | | |
| Link | | | |
| Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
| Coming soon | | | |
| Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | | | |
| Link | 360 | over 1 year ago | |
| SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
| Link | 38 | 12 months ago | |
| MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
| Link | 274 | 12 months ago | |
| SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
| Link | 322 | over 1 year ago | |
| MMBench: Is Your Multi-modal Model an All-around Player? | | | |
| Link | 168 | about 1 year ago | |
| What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
| Link | 231 | about 2 years ago | |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
| Link | 262 | over 1 year ago | |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
| Link | 13,117 | 10 months ago | |
| LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
| Link | 478 | over 1 year ago | |
| LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
| Link | 305 | over 1 year ago | |
| M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
| Link | 93 | over 2 years ago | |
| mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
| Link | 2,365 | 11 months ago | |
Awesome Datasets / Others |
| IMAD: IMage-Augmented multi-modal Dialogue | | | |
| Link | 4 | over 2 years ago | |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
| Link | 1,246 | about 1 year ago | |
| Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
| Link | | | |
| Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
| Link | | | |
| Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | | | |
| Link | | | |
| Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | | | |
| Link | | | |