Awesome-Multimodal-Large-Language-Models

Conversational AI resources

A collection of resources and papers on multimodal large language models for understanding and building advanced conversational AI systems.

sparklessparklesLatest Advances on Multimodal Large Language Models

GitHub

13k stars
256 watching
837 forks
last commit: about 1 month ago
Linked from 2 awesome lists

chain-of-thoughtin-context-learninginstruction-followinginstruction-tuninglarge-language-modelslarge-vision-language-modellarge-vision-language-modelsmulti-modalitymultimodal-chain-of-thoughtmultimodal-in-context-learningmultimodal-instruction-tuningmultimodal-large-language-modelsvisual-instruction-tuning

Awesome Papers / Multimodal Instruction Tuning

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Github 396 about 1 month ago
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Github
Demo
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Github 2,616 about 1 month ago
StreamChat: Chatting with Streaming Video
CompCap: Improving Multimodal Large Language Models with Composite Captions
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Github 13 about 1 month ago
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Github 6,394 about 1 month ago
Demo
NVILA: Efficient Frontier Visual Language Models
Github 2,146 about 1 month ago
Demo
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Github 44 about 1 month ago
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Github 67 about 2 months ago
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Github 106 about 2 months ago
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Github 329 2 months ago
Demo
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Github 89 about 2 months ago
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Github 57 2 months ago
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Huggingface
Demo
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Github 3,613 about 2 months ago
Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Github 183 3 months ago
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Github 549 4 months ago
Demo
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Github 69 3 months ago
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Github 2,365 about 2 months ago
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Github 1,005 3 months ago
LLaVA-OneVision: Easy Visual Task Transfer
Github 3,099 3 months ago
Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Github 12,870 3 months ago
Demo
VILA^2: VILA Augmented VILA
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
EVLM: An Efficient Vision-Language Model for Visual Understanding
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Github 26 about 2 months ago
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Github 2,616 about 1 month ago
Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Github 1,336 about 1 month ago
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Github 9 about 1 month ago
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Github 1,799 3 months ago
Long Context Transfer from Language to Vision
Github 347 about 2 months ago
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Github 1,091 about 1 month ago
TroL: Traversal of Layers for Large Language and Vision Models
Github 88 7 months ago
Unveiling Encoder-Free Vision-Language Models
Github 246 4 months ago
VideoLLM-online: Online Video Large Language Model for Streaming Video
Github 251 5 months ago
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Github 64 3 months ago
Demo
Comparison Visual Instruction Tuning
Github
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Github 143 2 months ago
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Github 957 about 2 months ago
Parrot: Multilingual Visual Instruction Tuning
Github 34 5 months ago
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Github 575 about 2 months ago
Matryoshka Query Transformer for Large Vision-Language Models
Github 101 7 months ago
Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Github 106 6 months ago
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Github 102 8 months ago
Demo
Libra: Building Decoupled Vision System on Large Language Models
Github 153 about 2 months ago
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Github 136 7 months ago
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Github 6,394 about 1 month ago
Demo
Graphic Design with Large Multimodal Model
Github 102 9 months ago
BRAVE: Broadening the visual encoding of vision-language models
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Github 2,616 about 1 month ago
Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Github 254 6 months ago
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Github 406 3 months ago
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model
LITA: Language Instructed Temporal-Localization Assistant
Github 151 3 months ago
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Github 3,229 9 months ago
Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Github 314 10 months ago
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Github 2,145 9 months ago
Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Github 1,849 about 2 months ago
Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Github 466 5 months ago
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Github 798 5 months ago
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Github 58 about 2 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Github 249 7 months ago
Demo
CoLLaVO: Crayon Large Language and Vision mOdel
Github 93 7 months ago
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Github 494 7 months ago
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Github 153 7 months ago
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Github 1,076 9 months ago
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Github 43 2 months ago
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Coming soon
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Github 20,683 5 months ago
Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Github 2,023 about 2 months ago
Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Github 2,616 about 1 month ago
Demo
Yi-VL 7,743 about 2 months ago
Github 7,743 about 2 months ago
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Github 108 4 months ago
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Github 1,076 9 months ago
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Github 6,394 about 1 month ago
Demo
Osprey: Pixel Understanding with Visual Instruction Tuning
Github 781 6 months ago
Demo
CogAgent: A Visual Language Model for GUI Agents
Github 6,182 8 months ago
Coming soon
Pixel Aligned Language Models
Coming soon
VILA: On Pre-training for Visual Language Models
Github 2,146 about 1 month ago
See, Say, and Segment: Teaching LMMs to Overcome False Premises
Coming soon
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Github 1,831 about 2 months ago
Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM
Github 435 8 months ago
Gemini: A Family of Highly Capable Multimodal Models
OneLLM: One Framework to Align All Modalities with Language
Github 601 3 months ago
Demo
Lenna: Language Enhanced Reasoning Detection Assistant
Github 78 12 months ago
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Github 314 about 2 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Github 302 6 months ago
Demo
Dolphins: Multimodal Language Model for Driving
Github 51 6 months ago
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Github 255 6 months ago
Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments
Github 231 7 months ago
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Github 1,958 4 months ago
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Github 748 6 months ago
Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant
Github 463 5 months ago
Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Github 202 about 1 year ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Github 2,616 about 1 month ago
Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Github 124 6 months ago
An Embodied Generalist Agent in 3D World
Github 379 3 months ago
Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Github 3,071 about 2 months ago
Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Github 895 3 months ago
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Github 131 about 1 year ago
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Github 2,732 8 months ago
Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Github 1,849 about 2 months ago
Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Github 717 12 months ago
Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation
Github 227 12 months ago
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Github 2,365 about 2 months ago
Demo
OtterHD: A High-Resolution Multi-modality Model
Github 3,570 11 months ago
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Coming soon
GLaMM: Pixel Grounding Large Multimodal Model
Github 797 about 2 months ago
Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Github 18 about 1 year ago
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Github 25,490 5 months ago
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Github 1,091 about 1 month ago
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Github 8,509 3 months ago
CogVLM: Visual Expert For Large Language Models
Github 6,182 8 months ago
Demo
Improved Baselines with Visual Instruction Tuning
Github 20,683 5 months ago
Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Github 751 10 months ago
Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Github 79 7 months ago
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Github 59 12 months ago
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Github 2,616 about 1 month ago
DreamLLM: Synergistic Multimodal Comprehension and Creation
Github 402 about 2 months ago
Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models
Coming soon
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Github 47 over 1 year ago
Demo
NExT-GPT: Any-to-Any Multimodal LLM
Github 3,344 3 months ago
Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Github 19 over 1 year ago
ImageBind-LLM: Multi-modality Instruction Tuning
Github 5,775 10 months ago
Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
PointLLM: Empowering Large Language Models to Understand Point Clouds
Github 670 3 months ago
Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 43 7 months ago
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Github 39 8 months ago
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
Github 37 over 1 year ago
Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Github 5,179 5 months ago
Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Github 1,098 7 months ago
Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Github 93 about 1 year ago
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Github 270 9 months ago
Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Github 360 8 months ago
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Github 466 5 months ago
Demo
LISA: Reasoning Segmentation via Large Language Model
Github 1,923 7 months ago
Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Github 550 about 1 month ago
3D-LLM: Injecting the 3D World into Large Language Models
Github 979 8 months ago
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Github 505 over 1 year ago
Demo
SVIT: Scaling up Visual Instruction Tuning
Github 164 7 months ago
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Github 517 7 months ago
Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Github 231 over 1 year ago
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Github 1,958 4 months ago
Demo
Visual Instruction Tuning with Polite Flamingo
Github 63 about 1 year ago
Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Github 259 7 months ago
Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 748 6 months ago
Demo
MotionGPT: Human Motion as a Foreign Language
Github 1,531 10 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Github 1,568 7 months ago
Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 305 9 months ago
Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Github 1,246 5 months ago
Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,570 11 months ago
Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Github 2,842 8 months ago
Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Github 1,622 5 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 762 about 1 year ago
Demo
PandaGPT: One Model To Instruction-Follow Them All
Github 772 over 1 year ago
Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Github 49 over 1 year ago
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Github 513 12 months ago
DetGPT: Detect What You Need via Reasoning
Github 761 5 months ago
Demo
Pengi: An Audio Language Model for Audio Tasks
Github 295 9 months ago
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Github 956 3 months ago
Listen, Think, and Understand
Github 396 9 months ago
Demo 396 9 months ago
Github 4,110 5 months ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Github 180 about 1 month ago
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Github 10,058 2 months ago
VideoChat: Chat-Centric Video Understanding
Github 3,106 about 2 months ago
Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Github 1,478 over 1 year ago
Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Github 308 over 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Github 48 6 months ago
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Github 5,775 10 months ago
Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Github 2,365 about 2 months ago
Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Github 25,490 5 months ago
Visual Instruction Tuning
GitHub 20,683 5 months ago
Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Github 5,775 10 months ago
Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Github 134 over 1 year ago

Awesome Papers / Multimodal Hallucination

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Github 28 about 2 months ago
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Github 46 2 months ago
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Link
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Github 83 2 months ago
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Github 20 3 months ago
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Github 18 6 months ago
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Coming soon
Mitigating Object Hallucination via Data Augmented Contrastive Tuning
Coming soon
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap
Coming soon
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
Github 15 4 months ago
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Debiasing Multimodal Large Language Models
Github 75 10 months ago
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Github 72 about 2 months ago
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Github 39 3 months ago
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
Github 19 7 months ago
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
Github 8 12 months ago
Unified Hallucination Detection for Multimodal Large Language Models
Github 48 9 months ago
A Survey on Hallucination in Large Vision-Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Github 82 12 months ago
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
Github 13 3 months ago
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
Github 8 12 months ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 245 4 months ago
Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Github 293 5 months ago
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Github 222 3 months ago
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Github 73 12 months ago
Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Github 41 6 months ago
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Github 98 about 1 year ago
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
Github 27 2 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 617 7 months ago
Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
Github 28 9 months ago
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Github 136 9 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 328 about 1 year ago
Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Github 17 over 1 year ago
VIGC: Visual Instruction Generation and Correction
Github 91 12 months ago
Demo
Detecting and Preventing Hallucinations in Large Vision Language Models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Github 262 10 months ago
Demo
Evaluating Object Hallucination in Large Vision-Language Models
Github 187 10 months ago

Awesome Papers / Multimodal In-Context Learning

Visual In-Context Learning for Large Vision-Language Models
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
Github 76 3 months ago
Can MLLMs Perform Text-to-Image In-Context Learning?
Github 30 2 months ago
Generative Multimodal Models are In-Context Learners
Github 1,672 4 months ago
Demo
Hijacking Context in Large Multi-modal Models
Towards More Unified In-context Visual Understanding
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Github 337 about 1 year ago
Demo
Link-Context Learning for Multimodal LLMs
Github 91 8 months ago
Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Github 3,781 5 months ago
Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner
Github 396 over 1 year ago
Generative Pretraining in Multimodality
Github 1,672 4 months ago
Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,570 11 months ago
Demo
Exploring Diverse In-Context Configurations for Image Captioning
Github 33 about 2 months ago
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,095 about 1 year ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,801 4 months ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 940 12 months ago
Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Github 50 over 1 year ago
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
Github 270 over 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 697 5 months ago
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Github 85 almost 3 years ago
Flamingo: a Visual Language Model for Few-Shot Learning
Github 3,781 5 months ago
Demo
Multimodal Few-Shot Learning with Frozen Language Models

Awesome Papers / Multimodal Chain-of-Thought

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Github 113 about 2 months ago
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Github 73 8 months ago
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
Github 162 about 2 months ago
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Github 90 7 months ago
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Github 35 10 months ago
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 748 6 months ago
Demo
Explainable Multimodal Emotion Reasoning
Github 123 9 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Github 346 9 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,693 over 1 year ago
Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Coming soon
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,095 about 1 year ago
Demo
Chain of Thought Prompt Tuning in Vision Language Models
Coming soon
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 940 12 months ago
Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,555 about 1 year ago
Demo
Multimodal Chain-of-Thought Reasoning in Language Models
Github 3,833 7 months ago
Visual Programming: Compositional visual reasoning without training
Github 697 5 months ago
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Github 615 4 months ago

Awesome Papers / LLM-Aided Visual Reasoning

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Github 14 3 months ago
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Github 541 about 1 year ago
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Github 353 6 months ago
Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Github 187 6 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 617 7 months ago
Demo
MindAgent: Emergent Gaming Interaction
Github 79 7 months ago
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Github 352 about 1 year ago
Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Github 66 over 1 year ago
AVIS: Autonomous Visual Information Seeking with Large Language Models
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 762 about 1 year ago
Demo
Mindstorms in Natural Language-Based Societies of Mind
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Github 306 9 months ago
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Github 32 over 1 year ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Github 7 over 1 year ago
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,693 over 1 year ago
Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,095 about 1 year ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,801 4 months ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 940 12 months ago
Demo
ViperGPT: Visual Inference via Python Execution for Reasoning
Github 1,666 12 months ago
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Github 457 almost 2 years ago
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,555 about 1 year ago
Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Github 41 over 1 year ago
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Github 10,058 2 months ago
Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Github 94 over 1 year ago
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
Github 235 over 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 697 5 months ago
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Github 34,478 about 1 month ago

Awesome Papers / Foundation Models

Emu3: Next-Token Prediction is All You Need
Github 1,911 3 months ago
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Demo
Pixtral-12B
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Github 10,058 2 months ago
The Llama 3 Herd of Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Hello GPT-4o
The Claude 3 Model Family: Opus, Sonnet, Haiku
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini: A Family of Highly Capable Multimodal Models
Fuyu-8B: A Multimodal Architecture for AI Agents
Huggingface
Demo
Unified Model for Image, Video, Audio and Language Tasks
Github 224 about 1 year ago
Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
GPT-4V(ision) System Card
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Github 544 3 months ago
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Github 24 about 1 year ago
Generative Pretraining in Multimodality
Github 1,672 4 months ago
Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World
Github 20,400 about 1 month ago
Demo
Transfer Visual Prompt Generator across LLMs
Github 270 over 1 year ago
Demo
GPT-4 Technical Report
PaLM-E: An Embodied Multimodal Language Model
Demo
Prismer: A Vision-Language Model with An Ensemble of Experts
Github 1,299 about 1 year ago
Demo
Language Is Not All You Need: Aligning Perception with Language Models
Github 20,400 about 1 month ago
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Github 10,058 2 months ago
Demo
VIMA: General Robot Manipulation with Multimodal Prompts
Github 781 9 months ago
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Github 1,843 10 months ago
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Github 43 over 1 year ago
Language Models are General-Purpose Interfaces
Github 20,400 about 1 month ago

Awesome Papers / Evaluation

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Github 106 about 2 months ago
OmniBench: Towards The Future of Universal Omni-Language Models
Github 15 2 months ago
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Github 86 about 2 months ago
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Github 3 5 months ago
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Github 22 4 months ago
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Github 67 3 months ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Github 85 3 months ago
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Github 95 6 months ago
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Github 422 about 1 month ago
Benchmarking Large Multimodal Models against Common Corruptions
Github 27 12 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Github 296 12 months ago
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Github 13,117 about 1 month ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Github 84 5 months ago
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Github 72 about 1 year ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Github 24 4 months ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Github 56 3 months ago
VLM-Eval: A General Evaluation on Video Large Language Models
Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Github 53 10 months ago
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Github 288 10 months ago
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
An Early Evaluation of GPT-4V(ision)
Github 11 about 1 year ago
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
Github 121 about 1 year ago
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Github 259 2 months ago
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Github 253 about 2 months ago
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
Github 14 about 1 year ago
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Github 21 11 months ago
Can We Edit Multimodal Large Language Models?
Github 1,981 about 1 month ago
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
Github 11 over 1 year ago
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)
TouchStone: Evaluating Vision-Language Models by Language Models
Github 79 12 months ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 43 7 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Github 38 3 months ago
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Github 478 9 months ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Github 274 2 months ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Github 322 6 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Github 168 5 months ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Github 13,117 about 1 month ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Github 478 9 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 305 9 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Github 93 over 1 year ago
On The Hidden Mystery of OCR in Large Multimodal Models
Github 484 3 months ago

Awesome Papers / Multimodal RLHF

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Silkie: Preference Distillation for Large Visual Language Models
Github 88 about 1 year ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 245 4 months ago
Demo
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 328 about 1 year ago
Demo
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Github 2 3 months ago

Awesome Papers / Others

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Github 7 about 1 month ago
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Github 47 5 months ago
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Github 266 9 months ago
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Github 135 6 months ago
Planting a SEED of Vision in Large Language Model
Github 585 4 months ago
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
Github 1,218 2 months ago
Contextual Object Detection with Multimodal Large Language Models
Github 208 3 months ago
Demo
Generating Images with Multimodal Language Models
Github 440 12 months ago
On Evaluating Adversarial Robustness of Large Vision-Language Models
Github 165 about 1 year ago
Grounding Language Models to Images for Multimodal Inputs and Outputs
Github 478 about 1 year ago
Demo

Awesome Datasets / Datasets of Pre-Training for Alignment

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
COYO-700M: Image-Text Pair Dataset 1,172 about 2 years ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Microsoft COCO: Common Objects in Context
Im2Text: Describing Images Using 1 Million Captioned Photographs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Kosmos-2: Grounding Multimodal Large Language Models to the World
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Awesome Datasets / Datasets of Multimodal Instruction Tuning

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Link 42 2 months ago
Multi-modal Situated Reasoning in 3D Scenes
Link
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Link
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Link 3 5 months ago
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Link 33 6 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 11 months ago
Visually Dehallucinative Instruction Generation
Link 5 10 months ago
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 58 4 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Link
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Link 18 about 1 year ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 43 7 months ago
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Link 93 about 1 year ago
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Link
SVIT: Scaling up Visual Instruction Tuning
Link
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Link 1,958 4 months ago
Visual Instruction Tuning with Polite Flamingo
Link
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Link
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Link
MotionGPT: Human Motion as a Foreign Language
Link 1,531 10 months ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 262 10 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Link 1,568 7 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 305 9 months ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,246 5 months ago
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,570 11 months ago
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Link
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Coming soon 1,622 5 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Link 762 about 1 year ago
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Coming soon
DetGPT: Detect What You Need via Reasoning
Link 761 5 months ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Coming soon
VideoChat: Chat-Centric Video Understanding
Link 1,467 about 1 month ago
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Link 308 over 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Link
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Link
Visual Instruction Tuning
Link
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Link 134 over 1 year ago

Awesome Datasets / Datasets of In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Link
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,570 11 months ago

Awesome Datasets / Datasets of Multimodal Chain-of-Thought

Explainable Multimodal Emotion Reasoning
Coming soon 123 9 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Coming soon 346 9 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
Coming soon
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Link 615 4 months ago

Awesome Datasets / Datasets of Multimodal RLHF

Silkie: Preference Distillation for Large Visual Language Models
Link

Awesome Datasets / Benchmarks for Evaluation

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Link 47 8 months ago
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Link 106 about 2 months ago
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Link 3 2 months ago
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Link
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Link
OmniBench: Towards The Future of Universal Omni-Language Models
Link
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Link
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?
Link 5 5 months ago
Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
Link 43 3 months ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Link
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Link 422 about 1 month ago
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Link 31 9 months ago
TempCompass: Do Video LLMs Really Understand Videos?
Link 91 2 months ago
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Link
Can MLLMs Perform Text-to-Image In-Context Learning?
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 11 months ago
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Link 74 3 months ago
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Link 22 6 months ago
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Link 46 4 months ago
Benchmarking Large Multimodal Models against Common Corruptions
Link 27 12 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Link 296 12 months ago
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Link
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 58 4 months ago
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Link 121 about 1 year ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Link 24 4 months ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Link 56 3 months ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Link
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Link 87 4 months ago
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Link 3,106 about 2 months ago
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Link 53 10 months ago
OtterHD: A High-Resolution Multi-modality Model
Link
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Link 259 2 months ago
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
Link 99 10 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Link
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Link
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 43 7 months ago
Link-Context Learning for Multimodal LLMs
Link
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Link 360 8 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Link 38 3 months ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Link 274 2 months ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Link 322 6 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Link 168 5 months ago
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Link 231 over 1 year ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 262 10 months ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Link 13,117 about 1 month ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Link 478 9 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 305 9 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Link 93 over 1 year ago
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Link 2,365 about 2 months ago

Awesome Datasets / Others

IMAD: IMage-Augmented multi-modal Dialogue
Link 4 over 1 year ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,246 5 months ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Link
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Link

Backlinks from these awesome lists:

More related projects: