Awesome-Multimodal-Large-Language-Models

Conversational AI resources

A collection of resources and papers on multimodal large language models for understanding and building advanced conversational AI systems.

Latest Advances on Multimodal Large Language Models

GitHub

13k stars

256 watching

837 forks

last commit: 10 months ago

Linked from 2 awesome lists

chain-of-thoughtin-context-learninginstruction-followinginstruction-tuninglarge-language-modelslarge-vision-language-modellarge-vision-language-modelsmulti-modalitymultimodal-chain-of-thoughtmultimodal-in-context-learningmultimodal-instruction-tuningmultimodal-large-language-modelsvisual-instruction-tuning

Awesome Papers / Multimodal Instruction Tuning
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Github	396	10 months ago
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Github
Demo
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Github	2,616	11 months ago
StreamChat: Chatting with Streaming Video
CompCap: Improving Multimodal Large Language Models with Composite Captions
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Github	13	10 months ago
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Github	6,394	11 months ago
Demo
NVILA: Efficient Frontier Visual Language Models
Github	2,146	11 months ago
Demo
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Github	44	10 months ago
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Github	67	11 months ago
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Github	106	11 months ago
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Github	329	12 months ago
Demo
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Github	89	11 months ago
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Github	57	11 months ago
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Huggingface
Demo
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Github	3,613	11 months ago
Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Github	183	about 1 year ago
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Github	549	about 1 year ago
Demo
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Github	69	about 1 year ago
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Github	2,365	11 months ago
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Github	1,005	about 1 year ago
LLaVA-OneVision: Easy Visual Task Transfer
Github	3,099	about 1 year ago
Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Github	12,870	about 1 year ago
Demo
VILA^2: VILA Augmented VILA
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
EVLM: An Efficient Vision-Language Model for Visual Understanding
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Github	26	11 months ago
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Github	2,616	11 months ago
Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Github	1,336	11 months ago
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Github	9	11 months ago
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Github	1,799	12 months ago
Long Context Transfer from Language to Vision
Github	347	11 months ago
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Github	1,091	11 months ago
TroL: Traversal of Layers for Large Language and Vision Models
Github	88	over 1 year ago
Unveiling Encoder-Free Vision-Language Models
Github	246	about 1 year ago
VideoLLM-online: Online Video Large Language Model for Streaming Video
Github	251	about 1 year ago
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Github	64	about 1 year ago
Demo
Comparison Visual Instruction Tuning
Github
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Github	143	12 months ago
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Github	957	11 months ago
Parrot: Multilingual Visual Instruction Tuning
Github	34	about 1 year ago
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Github	575	11 months ago
Matryoshka Query Transformer for Large Vision-Language Models
Github	101	over 1 year ago
Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Github	106	about 1 year ago
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Github	102	over 1 year ago
Demo
Libra: Building Decoupled Vision System on Large Language Models
Github	153	11 months ago
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Github	136	over 1 year ago
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Github	6,394	11 months ago
Demo
Graphic Design with Large Multimodal Model
Github	102	over 1 year ago
BRAVE: Broadening the visual encoding of vision-language models
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Github	2,616	11 months ago
Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Github	254	over 1 year ago
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Github	406	about 1 year ago
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model
LITA: Language Instructed Temporal-Localization Assistant
Github	151	about 1 year ago
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Github	3,229	over 1 year ago
Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Github	314	over 1 year ago
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Github	2,145	over 1 year ago
Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Github	1,849	11 months ago
Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Github	466	about 1 year ago
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Github	798	about 1 year ago
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Github	58	11 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Github	249	over 1 year ago
Demo
CoLLaVO: Crayon Large Language and Vision mOdel
Github	93	over 1 year ago
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Github	494	over 1 year ago
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Github	153	over 1 year ago
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Github	1,076	over 1 year ago
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Github	43	11 months ago
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Coming soon
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Github	20,683	about 1 year ago
Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Github	2,023	11 months ago
Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Github	2,616	11 months ago
Demo
Yi-VL	7,743	11 months ago
Github	7,743	11 months ago
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Github	108	about 1 year ago
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Github	1,076	over 1 year ago
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Github	6,394	11 months ago
Demo
Osprey: Pixel Understanding with Visual Instruction Tuning
Github	781	about 1 year ago
Demo
CogAgent: A Visual Language Model for GUI Agents
Github	6,182	over 1 year ago
Coming soon
Pixel Aligned Language Models
Coming soon
VILA: On Pre-training for Visual Language Models
Github	2,146	11 months ago
See, Say, and Segment: Teaching LMMs to Overcome False Premises
Coming soon
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Github	1,831	11 months ago
Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM
Github	435	over 1 year ago
Gemini: A Family of Highly Capable Multimodal Models
OneLLM: One Framework to Align All Modalities with Language
Github	601	about 1 year ago
Demo
Lenna: Language Enhanced Reasoning Detection Assistant
Github	78	over 1 year ago
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Github	314	11 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Github	302	over 1 year ago
Demo
Dolphins: Multimodal Language Model for Driving
Github	51	over 1 year ago
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Github	255	over 1 year ago
Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments
Github	231	over 1 year ago
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Github	1,958	about 1 year ago
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Github	748	about 1 year ago
Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant
Github	463	about 1 year ago
Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Github	202	almost 2 years ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Github	2,616	11 months ago
Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Github	124	over 1 year ago
An Embodied Generalist Agent in 3D World
Github	379	about 1 year ago
Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Github	3,071	11 months ago
Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Github	895	about 1 year ago
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Github	131	almost 2 years ago
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Github	2,732	over 1 year ago
Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Github	1,849	11 months ago
Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Github	717	over 1 year ago
Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation
Github	227	over 1 year ago
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Github	2,365	11 months ago
Demo
OtterHD: A High-Resolution Multi-modality Model
Github	3,570	over 1 year ago
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Coming soon
GLaMM: Pixel Grounding Large Multimodal Model
Github	797	11 months ago
Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Github	18	almost 2 years ago
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Github	25,490	about 1 year ago
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Github	1,091	11 months ago
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Github	8,509	about 1 year ago
CogVLM: Visual Expert For Large Language Models
Github	6,182	over 1 year ago
Demo
Improved Baselines with Visual Instruction Tuning
Github	20,683	about 1 year ago
Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Github	751	over 1 year ago
Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Github	79	over 1 year ago
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Github	59	over 1 year ago
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Github	2,616	11 months ago
DreamLLM: Synergistic Multimodal Comprehension and Creation
Github	402	11 months ago
Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models
Coming soon
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Github	47	about 2 years ago
Demo
NExT-GPT: Any-to-Any Multimodal LLM
Github	3,344	12 months ago
Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Github	19	about 2 years ago
ImageBind-LLM: Multi-modality Instruction Tuning
Github	5,775	over 1 year ago
Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
PointLLM: Empowering Large Language Models to Understand Point Clouds
Github	670	12 months ago
Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github	43	over 1 year ago
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Github	39	over 1 year ago
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
Github	37	about 2 years ago
Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Github	5,179	about 1 year ago
Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Github	1,098	over 1 year ago
Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Github	93	almost 2 years ago
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Github	270	over 1 year ago
Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Github	360	over 1 year ago
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Github	466	about 1 year ago
Demo
LISA: Reasoning Segmentation via Large Language Model
Github	1,923	over 1 year ago
Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Github	550	11 months ago
3D-LLM: Injecting the 3D World into Large Language Models
Github	979	over 1 year ago
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Github	505	over 2 years ago
Demo
SVIT: Scaling up Visual Instruction Tuning
Github	164	over 1 year ago
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Github	517	over 1 year ago
Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Github	231	about 2 years ago
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Github	1,958	about 1 year ago
Demo
Visual Instruction Tuning with Polite Flamingo
Github	63	almost 2 years ago
Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Github	259	over 1 year ago
Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github	748	over 1 year ago
Demo
MotionGPT: Human Motion as a Foreign Language
Github	1,531	over 1 year ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Github	1,568	over 1 year ago
Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github	305	over 1 year ago
Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Github	1,246	about 1 year ago
Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github	3,570	over 1 year ago
Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Github	2,842	over 1 year ago
Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Github	1,622	about 1 year ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github	762	almost 2 years ago
Demo
PandaGPT: One Model To Instruction-Follow Them All
Github	772	over 2 years ago
Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Github	49	about 2 years ago
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Github	513	over 1 year ago
DetGPT: Detect What You Need via Reasoning
Github	761	about 1 year ago
Demo
Pengi: An Audio Language Model for Audio Tasks
Github	295	over 1 year ago
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Github	956	about 1 year ago
Listen, Think, and Understand
Github	396	over 1 year ago
Demo	396	over 1 year ago
Github	4,110	about 1 year ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Github	180	11 months ago
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Github	10,058	11 months ago
VideoChat: Chat-Centric Video Understanding
Github	3,106	11 months ago
Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Github	1,478	over 2 years ago
Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Github	308	about 2 years ago
LMEye: An Interactive Perception Network for Large Language Models
Github	48	over 1 year ago
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Github	5,775	over 1 year ago
Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Github	2,365	11 months ago
Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Github	25,490	about 1 year ago
Visual Instruction Tuning
GitHub	20,683	about 1 year ago
Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Github	5,775	over 1 year ago
Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Github	134	over 2 years ago
Awesome Papers / Multimodal Hallucination
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Github	28	11 months ago
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Github	46	11 months ago
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Link
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Github	83	12 months ago
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Github	20	about 1 year ago
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Github	18	over 1 year ago
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Coming soon
Mitigating Object Hallucination via Data Augmented Contrastive Tuning
Coming soon
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap
Coming soon
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
Github	15	about 1 year ago
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Debiasing Multimodal Large Language Models
Github	75	over 1 year ago
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Github	72	11 months ago
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Github	39	12 months ago
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
Github	19	over 1 year ago
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
Github	8	over 1 year ago
Unified Hallucination Detection for Multimodal Large Language Models
Github	48	over 1 year ago
A Survey on Hallucination in Large Vision-Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Github	82	over 1 year ago
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
Github	13	about 1 year ago
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
Github	8	over 1 year ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github	245	about 1 year ago
Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Github	293	about 1 year ago
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Github	222	about 1 year ago
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Github	73	over 1 year ago
Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Github	41	over 1 year ago
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Github	98	almost 2 years ago
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
Github	27	12 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github	617	over 1 year ago
Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
Github	28	over 1 year ago
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Github	136	over 1 year ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Github	328	almost 2 years ago
Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Github	17	about 2 years ago
VIGC: Visual Instruction Generation and Correction
Github	91	over 1 year ago
Demo
Detecting and Preventing Hallucinations in Large Vision Language Models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Github	262	over 1 year ago
Demo
Evaluating Object Hallucination in Large Vision-Language Models
Github	187	over 1 year ago
Awesome Papers / Multimodal In-Context Learning
Visual In-Context Learning for Large Vision-Language Models
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
Github	76	about 1 year ago
Can MLLMs Perform Text-to-Image In-Context Learning?
Github	30	12 months ago
Generative Multimodal Models are In-Context Learners
Github	1,672	about 1 year ago
Demo
Hijacking Context in Large Multi-modal Models
Towards More Unified In-context Visual Understanding
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Github	337	almost 2 years ago
Demo
Link-Context Learning for Multimodal LLMs
Github	91	over 1 year ago
Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Github	3,781	about 1 year ago
Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner
Github	396	about 2 years ago
Generative Pretraining in Multimodality
Github	1,672	about 1 year ago
Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github	3,570	over 1 year ago
Demo
Exploring Diverse In-Context Configurations for Image Captioning
Github	33	11 months ago
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github	1,095	almost 2 years ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github	23,801	about 1 year ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github	940	over 1 year ago
Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Github	50	about 2 years ago
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
Github	270	over 2 years ago
Visual Programming: Compositional visual reasoning without training
Github	697	about 1 year ago
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Github	85	over 3 years ago
Flamingo: a Visual Language Model for Few-Shot Learning
Github	3,781	about 1 year ago
Demo
Multimodal Few-Shot Learning with Frozen Language Models
Awesome Papers / Multimodal Chain-of-Thought
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Github	113	11 months ago
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Github	73	over 1 year ago
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
Github	162	11 months ago
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Github	90	over 1 year ago
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Github	35	over 1 year ago
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github	748	over 1 year ago
Demo
Explainable Multimodal Emotion Reasoning
Github	123	over 1 year ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Github	346	over 1 year ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github	1,693	about 2 years ago
Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Coming soon
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github	1,095	almost 2 years ago
Demo
Chain of Thought Prompt Tuning in Vision Language Models
Coming soon
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github	940	over 1 year ago
Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github	34,555	almost 2 years ago
Demo
Multimodal Chain-of-Thought Reasoning in Language Models
Github	3,833	over 1 year ago
Visual Programming: Compositional visual reasoning without training
Github	697	about 1 year ago
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Github	615	about 1 year ago
Awesome Papers / LLM-Aided Visual Reasoning
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Github	14	about 1 year ago
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Github	541	almost 2 years ago
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Github	353	about 1 year ago
Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Github	187	over 1 year ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github	617	over 1 year ago
Demo
MindAgent: Emergent Gaming Interaction
Github	79	over 1 year ago
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Github	352	almost 2 years ago
Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Github	66	over 2 years ago
AVIS: Autonomous Visual Information Seeking with Large Language Models
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github	762	almost 2 years ago
Demo
Mindstorms in Natural Language-Based Societies of Mind
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Github	306	over 1 year ago
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Github	32	about 2 years ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Github	7	over 2 years ago
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github	1,693	about 2 years ago
Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github	1,095	almost 2 years ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github	23,801	about 1 year ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github	940	over 1 year ago
Demo
ViperGPT: Visual Inference via Python Execution for Reasoning
Github	1,666	over 1 year ago
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Github	457	over 2 years ago
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github	34,555	almost 2 years ago
Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Github	41	over 2 years ago
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Github	10,058	11 months ago
Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Github	94	about 2 years ago
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
Github	235	about 2 years ago
Visual Programming: Compositional visual reasoning without training
Github	697	about 1 year ago
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Github	34,478	11 months ago
Awesome Papers / Foundation Models
Emu3: Next-Token Prediction is All You Need
Github	1,911	about 1 year ago
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Demo
Pixtral-12B
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Github	10,058	11 months ago
The Llama 3 Herd of Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Hello GPT-4o
The Claude 3 Model Family: Opus, Sonnet, Haiku
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini: A Family of Highly Capable Multimodal Models
Fuyu-8B: A Multimodal Architecture for AI Agents
Huggingface
Demo
Unified Model for Image, Video, Audio and Language Tasks
Github	224	almost 2 years ago
Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
GPT-4V(ision) System Card
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Github	544	about 1 year ago
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Github	24	almost 2 years ago
Generative Pretraining in Multimodality
Github	1,672	about 1 year ago
Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World
Github	20,400	10 months ago
Demo
Transfer Visual Prompt Generator across LLMs
Github	270	about 2 years ago
Demo
GPT-4 Technical Report
PaLM-E: An Embodied Multimodal Language Model
Demo
Prismer: A Vision-Language Model with An Ensemble of Experts
Github	1,299	almost 2 years ago
Demo
Language Is Not All You Need: Aligning Perception with Language Models
Github	20,400	10 months ago
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Github	10,058	11 months ago
Demo
VIMA: General Robot Manipulation with Multimodal Prompts
Github	781	over 1 year ago
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Github	1,843	over 1 year ago
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Github	43	over 2 years ago
Language Models are General-Purpose Interfaces
Github	20,400	10 months ago
Awesome Papers / Evaluation
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Github	106	11 months ago
OmniBench: Towards The Future of Universal Omni-Language Models
Github	15	12 months ago
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Github	86	11 months ago
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Github	3	about 1 year ago
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Github	22	about 1 year ago
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Github	67	about 1 year ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Github	85	about 1 year ago
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Github	95	over 1 year ago
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Github	422	10 months ago
Benchmarking Large Multimodal Models against Common Corruptions
Github	27	almost 2 years ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Github	296	over 1 year ago
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Github	13,117	10 months ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Github	84	about 1 year ago
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Github	72	almost 2 years ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Github	24	about 1 year ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Github	56	about 1 year ago
VLM-Eval: A General Evaluation on Video Large Language Models
Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Github	53	over 1 year ago
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Github	288	over 1 year ago
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
An Early Evaluation of GPT-4V(ision)
Github	11	almost 2 years ago
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
Github	121	almost 2 years ago
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Github	259	11 months ago
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Github	253	11 months ago
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
Github	14	almost 2 years ago
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Github	21	over 1 year ago
Can We Edit Multimodal Large Language Models?
Github	1,981	10 months ago
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
Github	11	about 2 years ago
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)
TouchStone: Evaluating Vision-Language Models by Language Models
Github	79	almost 2 years ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github	43	over 1 year ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Github	38	12 months ago
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Github	478	over 1 year ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Github	274	12 months ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Github	322	over 1 year ago
MMBench: Is Your Multi-modal Model an All-around Player?
Github	168	about 1 year ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Github	13,117	10 months ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Github	478	over 1 year ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github	305	over 1 year ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Github	93	over 2 years ago
On The Hidden Mystery of OCR in Large Multimodal Models
Github	484	about 1 year ago
Awesome Papers / Multimodal RLHF
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Silkie: Preference Distillation for Large Visual Language Models
Github	88	almost 2 years ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github	245	about 1 year ago
Demo
Aligning Large Multimodal Models with Factually Augmented RLHF
Github	328	almost 2 years ago
Demo
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Github	2	about 1 year ago
Awesome Papers / Others
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Github	7	11 months ago
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Github	47	about 1 year ago
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Github	266	over 1 year ago
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Github	135	over 1 year ago
Planting a SEED of Vision in Large Language Model
Github	585	about 1 year ago
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
Github	1,218	12 months ago
Contextual Object Detection with Multimodal Large Language Models
Github	208	about 1 year ago
Demo
Generating Images with Multimodal Language Models
Github	440	almost 2 years ago
On Evaluating Adversarial Robustness of Large Vision-Language Models
Github	165	almost 2 years ago
Grounding Language Models to Images for Multimodal Inputs and Outputs
Github	478	almost 2 years ago
Demo
Awesome Datasets / Datasets of Pre-Training for Alignment
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
COYO-700M: Image-Text Pair Dataset	1,172	almost 3 years ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Microsoft COCO: Common Objects in Context
Im2Text: Describing Images Using 1 Million Captioned Photographs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Kosmos-2: Grounding Multimodal Large Language Models to the World
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Awesome Datasets / Datasets of Multimodal Instruction Tuning
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Link	42	12 months ago
Multi-modal Situated Reasoning in 3D Scenes
Link
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Link
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Link	3	about 1 year ago
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Link	33	over 1 year ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link	6	over 1 year ago
Visually Dehallucinative Instruction Generation
Link	5	over 1 year ago
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link	58	about 1 year ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Link
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Link	18	almost 2 years ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link	43	over 1 year ago
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Link	93	almost 2 years ago
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Link
SVIT: Scaling up Visual Instruction Tuning
Link
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Link	1,958	about 1 year ago
Visual Instruction Tuning with Polite Flamingo
Link
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Link
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Link
MotionGPT: Human Motion as a Foreign Language
Link	1,531	over 1 year ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link	262	over 1 year ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Link	1,568	over 1 year ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link	305	over 1 year ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link	1,246	about 1 year ago
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link	3,570	over 1 year ago
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Link
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Coming soon	1,622	about 1 year ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Link	762	almost 2 years ago
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Coming soon
DetGPT: Detect What You Need via Reasoning
Link	761	about 1 year ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Coming soon
VideoChat: Chat-Centric Video Understanding
Link	1,467	11 months ago
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Link	308	about 2 years ago
LMEye: An Interactive Perception Network for Large Language Models
Link
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Link
Visual Instruction Tuning
Link
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Link	134	over 2 years ago
Awesome Datasets / Datasets of In-Context Learning
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Link
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link	3,570	over 1 year ago
Awesome Datasets / Datasets of Multimodal Chain-of-Thought
Explainable Multimodal Emotion Reasoning
Coming soon	123	over 1 year ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Coming soon	346	over 1 year ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
Coming soon
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Link	615	about 1 year ago
Awesome Datasets / Datasets of Multimodal RLHF
Silkie: Preference Distillation for Large Visual Language Models
Link
Awesome Datasets / Benchmarks for Evaluation
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Link	47	over 1 year ago
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Link	106	11 months ago
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Link	3	12 months ago
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Link
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Link
OmniBench: Towards The Future of Universal Omni-Language Models
Link
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Link
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?
Link	5	about 1 year ago
Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
Link	43	about 1 year ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Link
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Link	422	10 months ago
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Link	31	over 1 year ago
TempCompass: Do Video LLMs Really Understand Videos?
Link	91	11 months ago
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Link
Can MLLMs Perform Text-to-Image In-Context Learning?
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link	6	over 1 year ago
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Link	74	about 1 year ago
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Link	22	about 1 year ago
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Link	46	about 1 year ago
Benchmarking Large Multimodal Models against Common Corruptions
Link	27	almost 2 years ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Link	296	over 1 year ago
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Link
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link	58	about 1 year ago
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Link	121	almost 2 years ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Link	24	about 1 year ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Link	56	about 1 year ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Link
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Link	87	about 1 year ago
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Link	3,106	11 months ago
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Link	53	over 1 year ago
OtterHD: A High-Resolution Multi-modality Model
Link
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Link	259	11 months ago
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
Link	99	over 1 year ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Link
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Link
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link	43	over 1 year ago
Link-Context Learning for Multimodal LLMs
Link
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Link	360	over 1 year ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Link	38	12 months ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Link	274	12 months ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Link	322	over 1 year ago
MMBench: Is Your Multi-modal Model an All-around Player?
Link	168	about 1 year ago
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Link	231	about 2 years ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link	262	over 1 year ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Link	13,117	10 months ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Link	478	over 1 year ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link	305	over 1 year ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Link	93	over 2 years ago
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Link	2,365	11 months ago
Awesome Datasets / Others
IMAD: IMage-Augmented multi-modal Dialogue
Link	4	over 2 years ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link	1,246	about 1 year ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Link
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Link

Awesome-Multimodal-Large-Language-Models

Awesome Papers / Multimodal Instruction Tuning

Awesome Papers / Multimodal Hallucination

Awesome Papers / Multimodal In-Context Learning

Awesome Papers / Multimodal Chain-of-Thought

Awesome Papers / LLM-Aided Visual Reasoning

Awesome Papers / Foundation Models

Awesome Papers / Evaluation

Awesome Papers / Multimodal RLHF

Awesome Papers / Others

Awesome Datasets / Datasets of Pre-Training for Alignment

Awesome Datasets / Datasets of Multimodal Instruction Tuning

Awesome Datasets / Datasets of In-Context Learning

Awesome Datasets / Datasets of Multimodal Chain-of-Thought

Awesome Datasets / Datasets of Multimodal RLHF

Awesome Datasets / Benchmarks for Evaluation

Awesome Datasets / Others

Backlinks from these awesome lists:

More related projects: