Awesome-Multimodal-Large-Language-Models

sparklessparklesLatest Advances on Multimodal Large Language Models

GitHub

12k stars
272 watching
769 forks
last commit: 11 days ago
Linked from 2 awesome lists

chain-of-thoughtin-context-learninginstruction-followinginstruction-tuninglarge-language-modelslarge-vision-language-modellarge-vision-language-modelsmulti-modalitymultimodal-chain-of-thoughtmultimodal-in-context-learningmultimodal-instruction-tuningmultimodal-large-language-modelsvisual-instruction-tuning

Awesome Papers / Multimodal Instruction Tuning

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Github 2,329 12 days ago
Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Github 151 11 days ago
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Github 503 17 days ago
Demo
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Github 2,248 13 days ago
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Github 791 14 days ago
LLaVA-OneVision: Easy Visual Task Transfer
Github 2,504 10 days ago
Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Github 12,047 22 days ago
Demo
VILA^2: VILA Augmented VILA
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
EVLM: An Efficient Vision-Language Model for Visual Understanding
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Github 2,467 about 1 month ago
Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Github 1,236 10 days ago
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Github 1,694 16 days ago
Long Context Transfer from Language to Vision
Github 297 about 1 month ago
Unveiling Encoder-Free Vision-Language Models
Github 208 3 months ago
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Github 130 10 days ago
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Github 750 23 days ago
Parrot: Multilingual Visual Instruction Tuning
Github 25 about 2 months ago
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Github 321 17 days ago
Matryoshka Query Transformer for Large Vision-Language Models
Github 91 3 months ago
Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Github 100 2 months ago
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Github 100 4 months ago
Demo
Libra: Building Decoupled Vision System on Large Language Models
Github 41 4 months ago
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Github 132 4 months ago
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Github 5,616 17 days ago
Demo
Graphic Design with Large Multimodal Model
Github 93 6 months ago
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Github 2,467 about 1 month ago
Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Github 215 3 months ago
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Github 3,186 5 months ago
Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Github 305 6 months ago
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Github 1,780 10 days ago
Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Github 446 about 2 months ago
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Github 746 about 1 month ago
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Github 43 3 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Github 239 3 months ago
Demo
CoLLaVO: Crayon Large Language and Vision mOdel
Github 87 3 months ago
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Github 146 3 months ago
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Github 990 6 months ago
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Coming soon
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Github 19,505 about 2 months ago
Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Github 1,933 5 months ago
Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Github 2,467 about 1 month ago
Demo
Yi-VL 7,615 13 days ago
Github 7,615 13 days ago
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Github 990 6 months ago
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Github 5,616 17 days ago
Demo
Osprey: Pixel Understanding with Visual Instruction Tuning
Github 752 2 months ago
Demo
CogAgent: A Visual Language Model for GUI Agents
Github 5,901 4 months ago
Coming soon
Pixel Aligned Language Models
Coming soon
See, Say, and Segment: Teaching LMMs to Overcome False Premises
Coming soon
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Github 1,758 about 1 month ago
Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM
Github 415 5 months ago
Gemini: A Family of Highly Capable Multimodal Models
OneLLM: One Framework to Align All Modalities with Language
Github 560 23 days ago
Demo
Lenna: Language Enhanced Reasoning Detection Assistant
Github 77 8 months ago
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Github 273 4 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Github 282 3 months ago
Demo
Dolphins: Multimodal Language Model for Driving
Github 29 3 months ago
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Github 227 3 months ago
Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments
Github 208 4 months ago
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Github 1,335 9 days ago
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Github 693 2 months ago
Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant
Github 451 about 2 months ago
Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Github 180 10 months ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Github 2,467 about 1 month ago
Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Github 116 3 months ago
An Embodied Generalist Agent in 3D World
Github 341 2 months ago
Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Github 2,875 11 days ago
Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Github 765 3 months ago
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Github 130 10 months ago
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Github 2,696 4 months ago
Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Github 1,780 10 days ago
Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Github 694 8 months ago
Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation
Github 205 8 months ago
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Github 2,248 13 days ago
Demo
OtterHD: A High-Resolution Multi-modality Model
Github 3,557 7 months ago
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Coming soon
GLaMM: Pixel Grounding Large Multimodal Model
Github 747 4 months ago
Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Github 18 11 months ago
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Github 25,327 about 1 month ago
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Github 8,325 8 months ago
CogVLM: Visual Expert For Large Language Models
Github 5,901 4 months ago
Demo
Improved Baselines with Visual Instruction Tuning
Github 19,505 about 2 months ago
Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Github 689 6 months ago
Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Github 73 4 months ago
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Github 53 8 months ago
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Github 2,467 about 1 month ago
DreamLLM: Synergistic Multimodal Comprehension and Creation
Github 382 6 months ago
Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models
Coming soon
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Github 49 about 1 year ago
Demo
NExT-GPT: Any-to-Any Multimodal LLM
Github 3,217 9 months ago
Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Github 19 about 1 year ago
ImageBind-LLM: Multi-modality Instruction Tuning
Github 5,701 7 months ago
Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
PointLLM: Empowering Large Language Models to Understand Point Clouds
Github 529 10 days ago
Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 41 4 months ago
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Github 34 4 months ago
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
Github 36 about 1 year ago
Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Github 4,857 about 2 months ago
Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Github 1,074 4 months ago
Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Github 90 10 months ago
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Github 263 6 months ago
Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Github 354 4 months ago
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Github 446 about 2 months ago
Demo
LISA: Reasoning Segmentation via Large Language Model
Github 1,779 3 months ago
Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Github 495 29 days ago
3D-LLM: Injecting the 3D World into Large Language Models
Github 911 4 months ago
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Github 498 about 1 year ago
Demo
SVIT: Scaling up Visual Instruction Tuning
Github 159 4 months ago
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Github 497 4 months ago
Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Github 227 about 1 year ago
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Github 1,335 9 days ago
Demo
Visual Instruction Tuning with Polite Flamingo
Github 63 10 months ago
Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Github 254 4 months ago
Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 733 3 months ago
Demo
MotionGPT: Human Motion as a Foreign Language
Github 1,458 6 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Github 1,531 4 months ago
Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 297 6 months ago
Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Github 1,164 about 1 month ago
Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,557 7 months ago
Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Github 2,726 4 months ago
Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Github 1,470 about 2 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 756 10 months ago
Demo
PandaGPT: One Model To Instruction-Follow Them All
Github 753 over 1 year ago
Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Github 46 about 1 year ago
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Github 502 8 months ago
DetGPT: Detect What You Need via Reasoning
Github 755 about 2 months ago
Demo
Pengi: An Audio Language Model for Audio Tasks
Github 282 6 months ago
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Github 856 23 days ago
Listen, Think, and Understand
Github 366 5 months ago
Demo 366 5 months ago
Github 4,074 about 1 month ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Github 166 7 months ago
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Github 9,717 about 1 month ago
VideoChat: Chat-Centric Video Understanding
Github 2,993 about 1 month ago
Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Github 1,467 over 1 year ago
Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Github 302 about 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Github 48 3 months ago
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Github 5,701 7 months ago
Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Github 2,248 13 days ago
Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Github 25,327 about 1 month ago
Visual Instruction Tuning
GitHub 19,505 about 2 months ago
Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Github 5,701 7 months ago
Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Github 133 over 1 year ago

Awesome Papers / Multimodal Hallucination

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Link
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Github 48 about 2 months ago
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Github 18 about 1 month ago
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Github 13 3 months ago
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Coming soon
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap
Coming soon
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
Github 14 10 days ago
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Debiasing Multimodal Large Language Models
Github 70 6 months ago
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Github 66 5 months ago
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Github 26 2 months ago
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
Github 16 4 months ago
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
Github 8 8 months ago
Unified Hallucination Detection for Multimodal Large Language Models
Github 49 6 months ago
A Survey on Hallucination in Large Vision-Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Github 76 8 months ago
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
Github 11 about 1 month ago
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
Github 7 8 months ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 220 25 days ago
Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Github 260 about 1 month ago
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Github 181 3 months ago
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Github 58 8 months ago
Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Github 40 3 months ago
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Github 89 9 months ago
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
Github 25 5 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 599 4 months ago
Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
Github 25 6 months ago
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Github 129 5 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 308 11 months ago
Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Github 17 about 1 year ago
VIGC: Visual Instruction Generation and Correction
Github 87 8 months ago
Demo
Detecting and Preventing Hallucinations in Large Vision Language Models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Github 246 7 months ago
Demo
Evaluating Object Hallucination in Large Vision-Language Models
Github 172 6 months ago

Awesome Papers / Multimodal In-Context Learning

Visual In-Context Learning for Large Vision-Language Models
Can MLLMs Perform Text-to-Image In-Context Learning?
Github 23 2 months ago
Generative Multimodal Models are In-Context Learners
Github 1,614 9 days ago
Demo
Hijacking Context in Large Multi-modal Models
Towards More Unified In-context Visual Understanding
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Github 324 10 months ago
Demo
Link-Context Learning for Multimodal LLMs
Github 80 5 months ago
Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Github 3,670 about 1 month ago
Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner
Github 375 about 1 year ago
Generative Pretraining in Multimodality
Github 1,614 9 days ago
Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,557 7 months ago
Demo
Exploring Diverse In-Context Configurations for Image Captioning
Github 28 3 months ago
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,080 10 months ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,575 10 days ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 929 8 months ago
Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Github 50 about 1 year ago
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
Github 263 over 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 684 about 1 month ago
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Github 83 over 2 years ago
Flamingo: a Visual Language Model for Few-Shot Learning
Github 3,670 about 1 month ago
Demo
Multimodal Few-Shot Learning with Frozen Language Models

Awesome Papers / Multimodal Chain-of-Thought

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Github 64 5 months ago
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
Github 97 3 months ago
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Github 29 7 months ago
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 733 3 months ago
Demo
Explainable Multimodal Emotion Reasoning
Github 115 5 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Github 330 5 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,661 about 1 year ago
Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Coming soon
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,080 10 months ago
Demo
Chain of Thought Prompt Tuning in Vision Language Models
Coming soon
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 929 8 months ago
Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,519 9 months ago
Demo
Multimodal Chain-of-Thought Reasoning in Language Models
Github 3,761 4 months ago
Visual Programming: Compositional visual reasoning without training
Github 684 about 1 month ago
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Github 587 17 days ago

Awesome Papers / LLM-Aided Visual Reasoning

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Github 12 6 months ago
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Github 507 9 months ago
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Github 346 2 months ago
Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Github 185 3 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 599 4 months ago
Demo
MindAgent: Emergent Gaming Interaction
Github 65 4 months ago
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Github 348 10 months ago
Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Github 65 over 1 year ago
AVIS: Autonomous Visual Information Seeking with Large Language Models
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 756 10 months ago
Demo
Mindstorms in Natural Language-Based Societies of Mind
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Github 284 6 months ago
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Github 31 12 months ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Github 7 over 1 year ago
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,661 about 1 year ago
Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,080 10 months ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,575 10 days ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 929 8 months ago
Demo
ViperGPT: Visual Inference via Python Execution for Reasoning
Github 1,651 8 months ago
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Github 450 over 1 year ago
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,519 9 months ago
Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Github 37 over 1 year ago
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Github 9,717 about 1 month ago
Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Github 91 about 1 year ago
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
Github 223 about 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 684 about 1 month ago
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Github 33,973 3 days ago

Awesome Papers / Foundation Models

Pixtral-12B
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Github 9,717 about 1 month ago
The Llama 3 Herd of Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Hello GPT-4o
The Claude 3 Model Family: Opus, Sonnet, Haiku
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini: A Family of Highly Capable Multimodal Models
Fuyu-8B: A Multimodal Architecture for AI Agents
Huggingface
Demo
Unified Model for Image, Video, Audio and Language Tasks
Github 223 10 months ago
Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
GPT-4V(ision) System Card
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Github 504 3 months ago
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Github 24 10 months ago
Generative Pretraining in Multimodality
Github 1,614 9 days ago
Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World
Github 19,604 about 1 month ago
Demo
Transfer Visual Prompt Generator across LLMs
Github 269 12 months ago
Demo
GPT-4 Technical Report
PaLM-E: An Embodied Multimodal Language Model
Demo
Prismer: A Vision-Language Model with An Ensemble of Experts
Github 1,294 9 months ago
Demo
Language Is Not All You Need: Aligning Perception with Language Models
Github 19,604 about 1 month ago
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Github 9,717 about 1 month ago
Demo
VIMA: General Robot Manipulation with Multimodal Prompts
Github 760 6 months ago
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Github 1,761 7 months ago
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Github 42 over 1 year ago
Language Models are General-Purpose Interfaces
Github 19,604 about 1 month ago

Awesome Papers / Evaluation

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Github 68 14 days ago
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Github 1 about 2 months ago
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Github 21 10 days ago
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Github 56 about 2 months ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Github 69 about 2 months ago
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Github 83 3 months ago
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Github 371 4 months ago
Benchmarking Large Multimodal Models against Common Corruptions
Github 27 9 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Github 280 8 months ago
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Github 12,008 11 days ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Github 81 about 2 months ago
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Github 63 10 months ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Github 24 about 1 month ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Github 50 about 2 months ago
VLM-Eval: A General Evaluation on Video Large Language Models
Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Github 53 6 months ago
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Github 286 7 months ago
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
An Early Evaluation of GPT-4V(ision)
Github 11 12 months ago
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
Github 117 11 months ago
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Github 226 9 days ago
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Github 225 20 days ago
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
Github 14 11 months ago
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Github 20 7 months ago
Can We Edit Multimodal Large Language Models?
Github 1,790 15 days ago
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
Github 11 12 months ago
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)
TouchStone: Evaluating Vision-Language Models by Language Models
Github 75 9 months ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 41 4 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Github 34 about 1 year ago
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Github 451 6 months ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Github 254 about 1 month ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Github 309 3 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Github 149 about 1 month ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Github 12,008 11 days ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Github 451 6 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 297 6 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Github 91 over 1 year ago
On The Hidden Mystery of OCR in Large Multimodal Models
Github 444 14 days ago

Awesome Papers / Multimodal RLHF

Silkie: Preference Distillation for Large Visual Language Models
Github 75 10 months ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 220 25 days ago
Demo
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 308 11 months ago
Demo

Awesome Papers / Others

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Github 38 about 2 months ago
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Github 258 6 months ago
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Github 123 2 months ago
Planting a SEED of Vision in Large Language Model
Github 564 14 days ago
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
Github 1,185 3 months ago
Contextual Object Detection with Multimodal Large Language Models
Github 182 over 1 year ago
Demo
Generating Images with Multimodal Language Models
Github 420 9 months ago
On Evaluating Adversarial Robustness of Large Vision-Language Models
Github 150 11 months ago
Grounding Language Models to Images for Multimodal Inputs and Outputs
Github 473 11 months ago
Demo

Awesome Datasets / Datasets of Pre-Training for Alignment

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
COYO-700M: Image-Text Pair Dataset 1,142 almost 2 years ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Microsoft COCO: Common Objects in Context
Im2Text: Describing Images Using 1 Million Captioned Photographs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Kosmos-2: Grounding Multimodal Large Language Models to the World
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Awesome Datasets / Datasets of Multimodal Instruction Tuning

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Link 1 about 2 months ago
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Link 31 3 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 8 months ago
Visually Dehallucinative Instruction Generation
Link 5 7 months ago
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 54 10 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Link
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Link 18 11 months ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 41 4 months ago
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Link 90 10 months ago
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Link
SVIT: Scaling up Visual Instruction Tuning
Link
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Link 1,335 9 days ago
Visual Instruction Tuning with Polite Flamingo
Link
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Link
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Link
MotionGPT: Human Motion as a Foreign Language
Link 1,458 6 months ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 246 7 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Link 1,531 4 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 297 6 months ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,164 about 1 month ago
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,557 7 months ago
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Link
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Coming soon 1,470 about 2 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Link 756 10 months ago
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Coming soon
DetGPT: Detect What You Need via Reasoning
Link 755 about 2 months ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Coming soon
VideoChat: Chat-Centric Video Understanding
Link 1,319 13 days ago
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Link 302 about 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Link
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Link
Visual Instruction Tuning
Link
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Link 133 over 1 year ago

Awesome Datasets / Datasets of In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Link
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,557 7 months ago

Awesome Datasets / Datasets of Multimodal Chain-of-Thought

Explainable Multimodal Emotion Reasoning
Coming soon 115 5 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Coming soon 330 5 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
Coming soon
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Link 587 17 days ago

Awesome Datasets / Datasets of Multimodal RLHF

Silkie: Preference Distillation for Large Visual Language Models
Link

Awesome Datasets / Benchmarks for Evaluation

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Link
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Link
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Link 371 4 months ago
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Link 24 6 months ago
TempCompass: Do Video LLMs Really Understand Videos?
Link 76 about 1 month ago
Can MLLMs Perform Text-to-Image In-Context Learning?
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 8 months ago
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Link 58 about 1 month ago
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Link 44 about 1 month ago
Benchmarking Large Multimodal Models against Common Corruptions
Link 27 9 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Link 280 8 months ago
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Link
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 54 10 months ago
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Link 116 9 months ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Link 24 about 1 month ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Link 50 about 2 months ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Link
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Link 77 13 days ago
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Link 2,993 about 1 month ago
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Link 53 6 months ago
OtterHD: A High-Resolution Multi-modality Model
Link
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Link 226 9 days ago
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
Link 98 7 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Link
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Link
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 41 4 months ago
Link-Context Learning for Multimodal LLMs
Link
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Link 354 4 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Link 34 about 1 year ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Link 254 about 1 month ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Link 309 3 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Link 149 about 1 month ago
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Link 227 about 1 year ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 246 7 months ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Link 12,008 11 days ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Link 451 6 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 297 6 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Link 91 over 1 year ago
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Link 2,248 13 days ago

Awesome Datasets / Others

IMAD: IMage-Augmented multi-modal Dialogue
Link 4 over 1 year ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,164 about 1 month ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Link
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Link

Backlinks from these awesome lists: