Awesome Papers / Multimodal Instruction Tuning |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | | | |
Github | 2,329 | 12 days ago | |
Demo | | | |
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture | | | |
Github | 151 | 11 days ago | |
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | | | |
Github | 503 | 17 days ago | |
Demo | | | |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | | | |
Github | 2,248 | 13 days ago | |
VITA: Towards Open-Source Interactive Omni Multimodal LLM | | | |
Github | 791 | 14 days ago | |
LLaVA-OneVision: Easy Visual Task Transfer | | | |
Github | 2,504 | 10 days ago | |
Demo | | | |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone | | | |
Github | 12,047 | 22 days ago | |
Demo | | | |
VILA^2: VILA Augmented VILA | | | |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | | | |
EVLM: An Efficient Vision-Language Model for Visual Understanding | | | |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | | | |
Github | 2,467 | about 1 month ago | |
Demo | | | |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | | | |
Github | 1,236 | 10 days ago | |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | | | |
Github | 1,694 | 16 days ago | |
Long Context Transfer from Language to Vision | | | |
Github | 297 | about 1 month ago | |
Unveiling Encoder-Free Vision-Language Models | | | |
Github | 208 | 3 months ago | |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models | | | |
Github | 130 | 10 days ago | |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | | | |
Github | 750 | 23 days ago | |
Parrot: Multilingual Visual Instruction Tuning | | | |
Github | 25 | about 2 months ago | |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model | | | |
Github | 321 | 17 days ago | |
Matryoshka Query Transformer for Large Vision-Language Models | | | |
Github | 91 | 3 months ago | |
Demo | | | |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | | | |
Github | 100 | 2 months ago | |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | | | |
Github | 100 | 4 months ago | |
Demo | | | |
Libra: Building Decoupled Vision System on Large Language Models | | | |
Github | 41 | 4 months ago | |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | | | |
Github | 132 | 4 months ago | |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | | | |
Github | 5,616 | 17 days ago | |
Demo | | | |
Graphic Design with Large Multimodal Model | | | |
Github | 93 | 6 months ago | |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | | | |
Github | 2,467 | about 1 month ago | |
Demo | | | |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | | | |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | | | |
Github | 215 | 3 months ago | |
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model | | | |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | | | |
Github | 3,186 | 5 months ago | |
Demo | | | |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | | |
MoAI: Mixture of All Intelligence for Large Language and Vision Models | | | |
Github | 305 | 6 months ago | |
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | | | |
Github | 1,780 | 10 days ago | |
Demo | | | |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | | | |
Github | 446 | about 2 months ago | |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | | |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | | | |
Github | 746 | about 1 month ago | |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning | | | |
Github | 43 | 3 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Github | 239 | 3 months ago | |
Demo | | | |
CoLLaVO: Crayon Large Language and Vision mOdel | | | |
Github | 87 | 3 months ago | |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | | | |
Github | 146 | 3 months ago | |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model | | | |
Github | 990 | 6 months ago | |
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study | | | |
Coming soon | | | |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | | | |
Github | 19,505 | about 2 months ago | |
Demo | | | |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | | | |
Github | 1,933 | 5 months ago | |
Demo | | | |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | | | |
Github | 2,467 | about 1 month ago | |
Demo | | | |
Yi-VL | 7,615 | 12 days ago | |
Github | 7,615 | 12 days ago | |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | | | |
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices | | | |
Github | 990 | 6 months ago | |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | | | |
Github | 5,616 | 17 days ago | |
Demo | | | |
Osprey: Pixel Understanding with Visual Instruction Tuning | | | |
Github | 752 | 2 months ago | |
Demo | | | |
CogAgent: A Visual Language Model for GUI Agents | | | |
Github | 5,901 | 4 months ago | |
Coming soon | | | |
Pixel Aligned Language Models | | | |
Coming soon | | | |
See, Say, and Segment: Teaching LMMs to Overcome False Premises | | | |
Coming soon | | | |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | | | |
Github | 1,758 | about 1 month ago | |
Demo | | | |
Honeybee: Locality-enhanced Projector for Multimodal LLM | | | |
Github | 415 | 5 months ago | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
OneLLM: One Framework to Align All Modalities with Language | | | |
Github | 560 | 23 days ago | |
Demo | | | |
Lenna: Language Enhanced Reasoning Detection Assistant | | | |
Github | 77 | 8 months ago | |
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | | | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Github | 273 | 4 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Github | 282 | 3 months ago | |
Demo | | | |
Dolphins: Multimodal Language Model for Driving | | | |
Github | 29 | 3 months ago | |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | | | |
Github | 227 | 3 months ago | |
Coming soon | | | |
VTimeLLM: Empower LLM to Grasp Video Moments | | | |
Github | 208 | 4 months ago | |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | | | |
Github | 1,335 | 9 days ago | |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | | | |
Github | 693 | 2 months ago | |
Coming soon | | | |
LLMGA: Multimodal Large Language Model based Generation Assistant | | | |
Github | 451 | about 2 months ago | |
Demo | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Github | 180 | 10 months ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
Github | 2,467 | about 1 month ago | |
Demo | | | |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | | | |
Github | 116 | 3 months ago | |
An Embodied Generalist Agent in 3D World | | | |
Github | 341 | 2 months ago | |
Demo | | | |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | | | |
Github | 2,875 | 11 days ago | |
Demo | | | |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | | | |
Github | 765 | 3 months ago | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Github | 130 | 10 months ago | |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | | | |
Github | 2,696 | 4 months ago | |
Demo | | | |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | | | |
Github | 1,780 | 10 days ago | |
Demo | | | |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | | | |
Github | 694 | 8 months ago | |
Demo | | | |
NExT-Chat: An LMM for Chat, Detection and Segmentation | | | |
Github | 205 | 8 months ago | |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | | | |
Github | 2,248 | 13 days ago | |
Demo | | | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Github | 3,557 | 7 months ago | |
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | | | |
Coming soon | | | |
GLaMM: Pixel Grounding Large Multimodal Model | | | |
Github | 747 | 4 months ago | |
Demo | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Github | 18 | 11 months ago | |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | | | |
Github | 25,327 | about 1 month ago | |
Ferret: Refer and Ground Anything Anywhere at Any Granularity | | | |
Github | 8,325 | 8 months ago | |
CogVLM: Visual Expert For Large Language Models | | | |
Github | 5,901 | 4 months ago | |
Demo | | | |
Improved Baselines with Visual Instruction Tuning | | | |
Github | 19,505 | about 2 months ago | |
Demo | | | |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | | | |
Github | 689 | 6 months ago | |
Demo | | | |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | | | |
Github | 73 | 4 months ago | |
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants | | | |
Github | 53 | 8 months ago | |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | | | |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | | | |
Github | 2,467 | about 1 month ago | |
DreamLLM: Synergistic Multimodal Comprehension and Creation | | | |
Github | 382 | 6 months ago | |
Coming soon | | | |
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models | | | |
Coming soon | | | |
TextBind: Multi-turn Interleaved Multimodal Instruction-following | | | |
Github | 49 | about 1 year ago | |
Demo | | | |
NExT-GPT: Any-to-Any Multimodal LLM | | | |
Github | 3,217 | 9 months ago | |
Demo | | | |
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics | | | |
Github | 19 | about 1 year ago | |
ImageBind-LLM: Multi-modality Instruction Tuning | | | |
Github | 5,701 | 7 months ago | |
Demo | | | |
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | | | |
PointLLM: Empowering Large Language Models to Understand Point Clouds | | | |
Github | 529 | 10 days ago | |
Demo | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 41 | 4 months ago | |
MLLM-DataEngine: An Iterative Refinement Approach for MLLM | | | |
Github | 34 | 4 months ago | |
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models | | | |
Github | 36 | about 1 year ago | |
Demo | | | |
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities | | | |
Github | 4,857 | about 2 months ago | |
Demo | | | |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | | | |
Github | 1,074 | 4 months ago | |
Demo | | | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Github | 90 | 10 months ago | |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions | | | |
Github | 263 | 6 months ago | |
Demo | | | |
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | | | |
Github | 354 | 4 months ago | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
Github | 446 | about 2 months ago | |
Demo | | | |
LISA: Reasoning Segmentation via Large Language Model | | | |
Github | 1,779 | 3 months ago | |
Demo | | | |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | | | |
Github | 495 | 29 days ago | |
3D-LLM: Injecting the 3D World into Large Language Models | | | |
Github | 911 | 4 months ago | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
Demo | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Github | 498 | about 1 year ago | |
Demo | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Github | 159 | 4 months ago | |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | | | |
Github | 497 | 4 months ago | |
Demo | | | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Github | 227 | about 1 year ago | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Github | 1,335 | 9 days ago | |
Demo | | | |
Visual Instruction Tuning with Polite Flamingo | | | |
Github | 63 | 10 months ago | |
Demo | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Github | 254 | 4 months ago | |
Demo | | | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 733 | 3 months ago | |
Demo | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Github | 1,458 | 6 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Github | 1,531 | 4 months ago | |
Coming soon | | | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 297 | 6 months ago | |
Demo | | | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Github | 1,164 | about 1 month ago | |
Demo | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,557 | 7 months ago | |
Demo | | | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | | | |
Github | 2,726 | 4 months ago | |
Demo | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Github | 1,470 | about 2 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 756 | 10 months ago | |
Demo | | | |
PandaGPT: One Model To Instruction-Follow Them All | | | |
Github | 753 | over 1 year ago | |
Demo | | | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Github | 46 | about 1 year ago | |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models | | | |
Github | 502 | 8 months ago | |
DetGPT: Detect What You Need via Reasoning | | | |
Github | 755 | about 2 months ago | |
Demo | | | |
Pengi: An Audio Language Model for Audio Tasks | | | |
Github | 282 | 6 months ago | |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks | | | |
Github | 856 | 23 days ago | |
Listen, Think, and Understand | | | |
Github | 366 | 5 months ago | |
Demo | 366 | 5 months ago | |
Github | 4,074 | about 1 month ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Github | 166 | 7 months ago | |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | | | |
Github | 9,717 | about 1 month ago | |
VideoChat: Chat-Centric Video Understanding | | | |
Github | 2,993 | about 1 month ago | |
Demo | | | |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans | | | |
Github | 1,467 | over 1 year ago | |
Demo | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Github | 302 | about 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Github | 48 | 3 months ago | |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | | | |
Github | 5,701 | 7 months ago | |
Demo | | | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Github | 2,248 | 13 days ago | |
Demo | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Github | 25,327 | about 1 month ago | |
Visual Instruction Tuning | | | |
GitHub | 19,505 | about 2 months ago | |
Demo | | | |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention | | | |
Github | 5,701 | 7 months ago | |
Demo | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Github | 133 | over 1 year ago | |
Awesome Papers / Multimodal Hallucination |
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs | | | |
Link | | | |
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation | | | |
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs | | | |
Github | 48 | about 2 months ago | |
Evaluating and Analyzing Relationship Hallucinations in LVLMs | | | |
Github | 18 | about 1 month ago | |
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention | | | |
Github | 13 | 3 months ago | |
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models | | | |
Coming soon | | | |
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap | | | |
Coming soon | | | |
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback | | | |
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | | | |
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models | | | |
Github | 14 | 10 days ago | |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | | |
Debiasing Multimodal Large Language Models | | | |
Github | 70 | 6 months ago | |
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding | | | |
Github | 66 | 5 months ago | |
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding | | | |
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective | | | |
Github | 26 | 2 months ago | |
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models | | | |
Github | 16 | 4 months ago | |
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs | | | |
Github | 8 | 8 months ago | |
Unified Hallucination Detection for Multimodal Large Language Models | | | |
Github | 49 | 6 months ago | |
A Survey on Hallucination in Large Vision-Language Models | | | |
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models | | | |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | | | |
Github | 76 | 8 months ago | |
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations | | | |
Github | 11 | about 1 month ago | |
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites | | | |
Github | 7 | 8 months ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 220 | 25 days ago | |
Demo | | | |
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation | | | |
Github | 260 | about 1 month ago | |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding | | | |
Github | 181 | 3 months ago | |
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization | | | |
Github | 58 | 8 months ago | |
Comins Soon | | | |
Mitigating Hallucination in Visual Language Models with Visual Supervision | | | |
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data | | | |
Github | 40 | 3 months ago | |
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | | | |
Github | 89 | 9 months ago | |
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models | | | |
Github | 25 | 5 months ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 599 | 4 months ago | |
Demo | | | |
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | | | |
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption | | | |
Github | 25 | 6 months ago | |
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models | | | |
Github | 129 | 5 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 308 | 11 months ago | |
Demo | | | |
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models | | | |
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning | | | |
Evaluation and Analysis of Hallucination in Large Vision-Language Models | | | |
Github | 17 | about 1 year ago | |
VIGC: Visual Instruction Generation and Correction | | | |
Github | 87 | 8 months ago | |
Demo | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Github | 246 | 7 months ago | |
Demo | | | |
Evaluating Object Hallucination in Large Vision-Language Models | | | |
Github | 172 | 6 months ago | |
Awesome Papers / Multimodal In-Context Learning |
Visual In-Context Learning for Large Vision-Language Models | | | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Github | 23 | 2 months ago | |
Generative Multimodal Models are In-Context Learners | | | |
Github | 1,614 | 9 days ago | |
Demo | | | |
Hijacking Context in Large Multi-modal Models | | | |
Towards More Unified In-context Visual Understanding | | | |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Github | 324 | 10 months ago | |
Demo | | | |
Link-Context Learning for Multimodal LLMs | | | |
Github | 80 | 5 months ago | |
Demo | | | |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | | | |
Github | 3,670 | about 1 month ago | |
Demo | | | |
Med-Flamingo: a Multimodal Medical Few-shot Learner | | | |
Github | 375 | about 1 year ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,614 | 9 days ago | |
Demo | | | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,557 | 7 months ago | |
Demo | | | |
Exploring Diverse In-Context Configurations for Image Captioning | | | |
Github | 28 | 3 months ago | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,080 | 10 months ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,575 | 10 days ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 929 | 8 months ago | |
Demo | | | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Github | 50 | about 1 year ago | |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | | | |
Github | 263 | over 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 684 | about 1 month ago | |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | | | |
Github | 83 | over 2 years ago | |
Flamingo: a Visual Language Model for Few-Shot Learning | | | |
Github | 3,670 | about 1 month ago | |
Demo | | | |
Multimodal Few-Shot Learning with Frozen Language Models | | | |
Awesome Papers / Multimodal Chain-of-Thought |
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | | | |
Github | 64 | 5 months ago | |
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models | | | |
Github | 97 | 3 months ago | |
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | | | |
Github | 29 | 7 months ago | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 733 | 3 months ago | |
Demo | | | |
Explainable Multimodal Emotion Reasoning | | | |
Github | 115 | 5 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Github | 330 | 5 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering | | | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,661 | about 1 year ago | |
Demo | | | |
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings | | | |
Coming soon | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,080 | 10 months ago | |
Demo | | | |
Chain of Thought Prompt Tuning in Vision Language Models | | | |
Coming soon | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 929 | 8 months ago | |
Demo | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,519 | 9 months ago | |
Demo | | | |
Multimodal Chain-of-Thought Reasoning in Language Models | | | |
Github | 3,761 | 4 months ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 684 | about 1 month ago | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Github | 587 | 17 days ago | |
Awesome Papers / LLM-Aided Visual Reasoning |
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models | | | |
Github | 12 | 6 months ago | |
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs | | | |
Github | 507 | 9 months ago | |
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | | | |
Github | 346 | 2 months ago | |
Demo | | | |
MM-VID: Advancing Video Understanding with GPT-4V(vision) | | | |
ControlLLM: Augment Language Models with Tools by Searching on Graphs | | | |
Github | 185 | 3 months ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 599 | 4 months ago | |
Demo | | | |
MindAgent: Emergent Gaming Interaction | | | |
Github | 65 | 4 months ago | |
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language | | | |
Github | 348 | 10 months ago | |
Demo | | | |
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | | | |
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn | | | |
Github | 65 | over 1 year ago | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 756 | 10 months ago | |
Demo | | | |
Mindstorms in Natural Language-Based Societies of Mind | | | |
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | | | |
Github | 284 | 6 months ago | |
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models | | | |
Github | 31 | 12 months ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Github | 7 | over 1 year ago | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,661 | about 1 year ago | |
Demo | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,080 | 10 months ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,575 | 10 days ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 929 | 8 months ago | |
Demo | | | |
ViperGPT: Visual Inference via Python Execution for Reasoning | | | |
Github | 1,651 | 8 months ago | |
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | | | |
Github | 450 | over 1 year ago | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,519 | 9 months ago | |
Demo | | | |
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners | | | |
Github | 37 | over 1 year ago | |
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | | | |
Github | 9,717 | about 1 month ago | |
Demo | | | |
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models | | | |
Github | 91 | about 1 year ago | |
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning | | | |
Github | 223 | about 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 684 | about 1 month ago | |
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | | | |
Github | 33,973 | 2 days ago | |
Awesome Papers / Foundation Models |
Pixtral-12B | | | |
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | | | |
Github | 9,717 | about 1 month ago | |
The Llama 3 Herd of Models | | | |
Chameleon: Mixed-Modal Early-Fusion Foundation Models | | | |
Hello GPT-4o | | | |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | | |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | | | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
Fuyu-8B: A Multimodal Architecture for AI Agents | | | |
Huggingface | | | |
Demo | | | |
Unified Model for Image, Video, Audio and Language Tasks | | | |
Github | 223 | 10 months ago | |
Demo | | | |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | | | |
GPT-4V(ision) System Card | | | |
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | | | |
Github | 504 | 3 months ago | |
Multimodal Foundation Models: From Specialists to General-Purpose Assistants | | | |
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | | | |
Github | 24 | 10 months ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,614 | 9 days ago | |
Demo | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Github | 19,604 | about 1 month ago | |
Demo | | | |
Transfer Visual Prompt Generator across LLMs | | | |
Github | 269 | 12 months ago | |
Demo | | | |
GPT-4 Technical Report | | | |
PaLM-E: An Embodied Multimodal Language Model | | | |
Demo | | | |
Prismer: A Vision-Language Model with An Ensemble of Experts | | | |
Github | 1,294 | 9 months ago | |
Demo | | | |
Language Is Not All You Need: Aligning Perception with Language Models | | | |
Github | 19,604 | about 1 month ago | |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | | | |
Github | 9,717 | about 1 month ago | |
Demo | | | |
VIMA: General Robot Manipulation with Multimodal Prompts | | | |
Github | 760 | 6 months ago | |
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge | | | |
Github | 1,761 | 7 months ago | |
Write and Paint: Generative Vision-Language Models are Unified Modal Learners | | | |
Github | 42 | over 1 year ago | |
Language Models are General-Purpose Interfaces | | | |
Github | 19,604 | about 1 month ago | |
Awesome Papers / Evaluation |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Github | 68 | 13 days ago | |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Github | 1 | about 2 months ago | |
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | | | |
Github | 21 | 10 days ago | |
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs | | | |
Github | 56 | about 2 months ago | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Github | 69 | about 2 months ago | |
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | | | |
Github | 83 | 3 months ago | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Github | 371 | 4 months ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Github | 27 | 9 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Github | 280 | 8 months ago | |
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | | | |
Github | 12,008 | 11 days ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Github | 81 | about 2 months ago | |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | | | |
Github | 63 | 10 months ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Github | 24 | about 1 month ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Github | 50 | about 2 months ago | |
VLM-Eval: A General Evaluation on Video Large Language Models | | | |
Coming soon | | | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Github | 53 | 6 months ago | |
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving | | | |
Github | 286 | 7 months ago | |
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead | | | |
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging | | | |
An Early Evaluation of GPT-4V(ision) | | | |
Github | 11 | 12 months ago | |
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation | | | |
Github | 117 | 11 months ago | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Github | 226 | 9 days ago | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Github | 225 | 20 days ago | |
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | | | |
Github | 14 | 11 months ago | |
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning | | | |
Github | 20 | 7 months ago | |
Can We Edit Multimodal Large Language Models? | | | |
Github | 1,790 | 15 days ago | |
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets | | | |
Github | 11 | 12 months ago | |
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) | | | |
TouchStone: Evaluating Vision-Language Models by Language Models | | | |
Github | 75 | 9 months ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 41 | 4 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Github | 34 | about 1 year ago | |
Tiny LVLM-eHub: Early Multimodal Experiments with Bard | | | |
Github | 451 | 6 months ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Github | 254 | about 1 month ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Github | 309 | 3 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Github | 149 | about 1 month ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Github | 12,008 | 11 days ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Github | 451 | 6 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 297 | 6 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Github | 91 | over 1 year ago | |
On The Hidden Mystery of OCR in Large Multimodal Models | | | |
Github | 444 | 14 days ago | |
Awesome Papers / Multimodal RLHF |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Github | 75 | 10 months ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 220 | 25 days ago | |
Demo | | | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 308 | 11 months ago | |
Demo | | | |
Awesome Papers / Others |
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | | | |
Github | 38 | about 2 months ago | |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models | | | |
Github | 258 | 6 months ago | |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs | | | |
Github | 123 | 2 months ago | |
Planting a SEED of Vision in Large Language Model | | | |
Github | 564 | 14 days ago | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | | | |
Github | 1,185 | 3 months ago | |
Contextual Object Detection with Multimodal Large Language Models | | | |
Github | 182 | over 1 year ago | |
Demo | | | |
Generating Images with Multimodal Language Models | | | |
Github | 420 | 9 months ago | |
On Evaluating Adversarial Robustness of Large Vision-Language Models | | | |
Github | 150 | 11 months ago | |
Grounding Language Models to Images for Multimodal Inputs and Outputs | | | |
Github | 473 | 11 months ago | |
Demo | | | |
Awesome Datasets / Datasets of Pre-Training for Alignment |
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | | | |
COYO-700M: Image-Text Pair Dataset | 1,142 | almost 2 years ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | | | |
Microsoft COCO: Common Objects in Context | | | |
Im2Text: Describing Images Using 1 Million Captioned Photographs | | | |
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | |
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs | | | |
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | | | |
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | | | |
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding | | | |
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | | | |
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | | | |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | | | |
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research | | | |
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline | | | |
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Awesome Datasets / Datasets of Multimodal Instruction Tuning |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Link | 1 | about 2 months ago | |
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | | | |
Link | 31 | 3 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 8 months ago | |
Visually Dehallucinative Instruction Generation | | | |
Link | 5 | 7 months ago | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 54 | 10 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Link | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Link | 18 | 11 months ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 41 | 4 months ago | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Link | 90 | 10 months ago | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Link | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Link | | | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Link | 1,335 | 9 days ago | |
Visual Instruction Tuning with Polite Flamingo | | | |
Link | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Link | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Link | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Link | 1,458 | 6 months ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 246 | 7 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Link | 1,531 | 4 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 297 | 6 months ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,164 | about 1 month ago | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,557 | 7 months ago | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Link | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Coming soon | 1,470 | about 2 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Link | 756 | 10 months ago | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Coming soon | | | |
DetGPT: Detect What You Need via Reasoning | | | |
Link | 755 | about 2 months ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Coming soon | | | |
VideoChat: Chat-Centric Video Understanding | | | |
Link | 1,319 | 13 days ago | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Link | 302 | about 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Link | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Link | | | |
Visual Instruction Tuning | | | |
Link | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Link | 133 | over 1 year ago | |
Awesome Datasets / Datasets of In-Context Learning |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Link | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,557 | 7 months ago | |
Awesome Datasets / Datasets of Multimodal Chain-of-Thought |
Explainable Multimodal Emotion Reasoning | | | |
Coming soon | 115 | 5 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Coming soon | 330 | 5 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
Coming soon | | | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Link | 587 | 17 days ago | |
Awesome Datasets / Datasets of Multimodal RLHF |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Link | | | |
Awesome Datasets / Benchmarks for Evaluation |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Link | | | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Link | | | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Link | 371 | 4 months ago | |
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning | | | |
Link | 24 | 6 months ago | |
TempCompass: Do Video LLMs Really Understand Videos? | | | |
Link | 76 | about 1 month ago | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 8 months ago | |
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | | | |
Link | 58 | about 1 month ago | |
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | | | |
Link | 44 | about 1 month ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Link | 27 | 9 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Link | 280 | 8 months ago | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Link | | | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 54 | 10 months ago | |
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | | | |
Link | 116 | 9 months ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Link | 24 | about 1 month ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Link | 50 | about 2 months ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Link | | | |
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | | | |
Link | 77 | 13 days ago | |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | | | |
Link | 2,993 | about 1 month ago | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Link | 53 | 6 months ago | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Link | | | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Link | 226 | 9 days ago | |
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | | | |
Link | 98 | 7 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Link | | | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Link | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 41 | 4 months ago | |
Link-Context Learning for Multimodal LLMs | | | |
Link | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | | | |
Link | 354 | 4 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Link | 34 | about 1 year ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Link | 254 | about 1 month ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Link | 309 | 3 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Link | 149 | about 1 month ago | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Link | 227 | about 1 year ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 246 | 7 months ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Link | 12,008 | 11 days ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Link | 451 | 6 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 297 | 6 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Link | 91 | over 1 year ago | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Link | 2,248 | 13 days ago | |
Awesome Datasets / Others |
IMAD: IMage-Augmented multi-modal Dialogue | | | |
Link | 4 | over 1 year ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,164 | about 1 month ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | | | |
Link | | | |
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | | | |
Link | | | |