Awesome Papers / Multimodal Instruction Tuning |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | | | |
Github | 270 | 16 days ago | |
Demo | | | |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | | | |
Github | 80 | 24 days ago | |
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models | | | |
Huggingface | | | |
Demo | | | |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | | | |
Github | 3,093 | about 2 months ago | |
Demo | | | |
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture | | | |
Github | 179 | about 1 month ago | |
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | | | |
Github | 539 | 2 months ago | |
Demo | | | |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | | | |
Github | 2,321 | about 1 month ago | |
VITA: Towards Open-Source Interactive Omni Multimodal LLM | | | |
Github | 961 | 28 days ago | |
LLaVA-OneVision: Easy Visual Task Transfer | | | |
Github | 2,872 | about 1 month ago | |
Demo | | | |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone | | | |
Github | 12,619 | about 1 month ago | |
Demo | | | |
VILA^2: VILA Augmented VILA | | | |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | | | |
EVLM: An Efficient Vision-Language Model for Visual Understanding | | | |
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model | | | |
Github | 25 | about 1 month ago | |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | | | |
Github | 2,521 | about 1 month ago | |
Demo | | | |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | | | |
Github | 1,300 | about 2 months ago | |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | | | |
Github | 1,759 | 22 days ago | |
Long Context Transfer from Language to Vision | | | |
Github | 334 | 26 days ago | |
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | | | |
Github | 1,053 | 15 days ago | |
Unveiling Encoder-Free Vision-Language Models | | | |
Github | 230 | about 2 months ago | |
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics | | | |
Github | 53 | about 1 month ago | |
Demo | | | |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models | | | |
Github | 137 | 14 days ago | |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | | | |
Github | 871 | 8 days ago | |
Parrot: Multilingual Visual Instruction Tuning | | | |
Github | 30 | 3 months ago | |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model | | | |
Github | 517 | 17 days ago | |
Matryoshka Query Transformer for Large Vision-Language Models | | | |
Github | 97 | 5 months ago | |
Demo | | | |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | | | |
Github | 104 | 4 months ago | |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | | | |
Github | 102 | 6 months ago | |
Demo | | | |
Libra: Building Decoupled Vision System on Large Language Models | | | |
Github | 143 | about 1 month ago | |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | | | |
Github | 134 | 6 months ago | |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | | | |
Github | 6,014 | 6 days ago | |
Demo | | | |
Graphic Design with Large Multimodal Model | | | |
Github | 98 | 7 months ago | |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | | | |
Github | 2,521 | about 1 month ago | |
Demo | | | |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | | | |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | | | |
Github | 244 | 4 months ago | |
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model | | | |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | | | |
Github | 3,211 | 7 months ago | |
Demo | | | |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | | |
MoAI: Mixture of All Intelligence for Large Language and Vision Models | | | |
Github | 311 | 8 months ago | |
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | | | |
Github | 1,825 | 9 days ago | |
Demo | | | |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | | | |
Github | 459 | 3 months ago | |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | | |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | | | |
Github | 779 | 3 months ago | |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning | | | |
Github | 54 | 5 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Github | 246 | 5 months ago | |
Demo | | | |
CoLLaVO: Crayon Large Language and Vision mOdel | | | |
Github | 93 | 5 months ago | |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | | | |
Github | 152 | 5 months ago | |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model | | | |
Github | 1,039 | 7 months ago | |
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning | | | |
Github | 40 | 6 days ago | |
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study | | | |
Coming soon | | | |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | | | |
Github | 20,232 | 3 months ago | |
Demo | | | |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | | | |
Github | 1,980 | 6 months ago | |
Demo | | | |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | | | |
Github | 2,521 | about 1 month ago | |
Demo | | | |
Yi-VL | 7,699 | 11 days ago | |
Github | 7,699 | 11 days ago | |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | | | |
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning | | | |
Github | 107 | 3 months ago | |
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices | | | |
Github | 1,039 | 7 months ago | |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | | | |
Github | 6,014 | 6 days ago | |
Demo | | | |
Osprey: Pixel Understanding with Visual Instruction Tuning | | | |
Github | 770 | 4 months ago | |
Demo | | | |
CogAgent: A Visual Language Model for GUI Agents | | | |
Github | 6,080 | 6 months ago | |
Coming soon | | | |
Pixel Aligned Language Models | | | |
Coming soon | | | |
See, Say, and Segment: Teaching LMMs to Overcome False Premises | | | |
Coming soon | | | |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | | | |
Github | 1,817 | about 2 months ago | |
Demo | | | |
Honeybee: Locality-enhanced Projector for Multimodal LLM | | | |
Github | 432 | 7 months ago | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
OneLLM: One Framework to Align All Modalities with Language | | | |
Github | 588 | about 1 month ago | |
Demo | | | |
Lenna: Language Enhanced Reasoning Detection Assistant | | | |
Github | 78 | 10 months ago | |
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | | | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Github | 286 | 6 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Github | 294 | 4 months ago | |
Demo | | | |
Dolphins: Multimodal Language Model for Driving | | | |
Github | 42 | 4 months ago | |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | | | |
Github | 248 | 4 months ago | |
Coming soon | | | |
VTimeLLM: Empower LLM to Grasp Video Moments | | | |
Github | 225 | 5 months ago | |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | | | |
Github | 1,563 | about 2 months ago | |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | | | |
Github | 733 | 4 months ago | |
Coming soon | | | |
LLMGA: Multimodal Large Language Model based Generation Assistant | | | |
Github | 461 | 3 months ago | |
Demo | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Github | 196 | 12 months ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
Github | 2,521 | about 1 month ago | |
Demo | | | |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | | | |
Github | 121 | 4 months ago | |
An Embodied Generalist Agent in 3D World | | | |
Github | 365 | about 1 month ago | |
Demo | | | |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | | | |
Github | 2,990 | about 2 months ago | |
Demo | | | |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | | | |
Github | 847 | about 1 month ago | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Github | 131 | 11 months ago | |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | | | |
Github | 2,720 | 6 months ago | |
Demo | | | |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | | | |
Github | 1,825 | 9 days ago | |
Demo | | | |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | | | |
Github | 704 | 10 months ago | |
Demo | | | |
NExT-Chat: An LMM for Chat, Detection and Segmentation | | | |
Github | 217 | 10 months ago | |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | | | |
Github | 2,321 | about 1 month ago | |
Demo | | | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Github | 3,563 | 9 months ago | |
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | | | |
Coming soon | | | |
GLaMM: Pixel Grounding Large Multimodal Model | | | |
Github | 781 | 6 months ago | |
Demo | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Github | 18 | about 1 year ago | |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | | | |
Github | 25,422 | 3 months ago | |
SALMONN: Towards Generic Hearing Abilities for Large Language Models | | | |
Github | 1,053 | 15 days ago | |
Ferret: Refer and Ground Anything Anywhere at Any Granularity | | | |
Github | 8,476 | about 1 month ago | |
CogVLM: Visual Expert For Large Language Models | | | |
Github | 6,080 | 6 months ago | |
Demo | | | |
Improved Baselines with Visual Instruction Tuning | | | |
Github | 20,232 | 3 months ago | |
Demo | | | |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | | | |
Github | 723 | 8 months ago | |
Demo | | | |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | | | |
Github | 76 | 5 months ago | |
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants | | | |
Github | 57 | 10 months ago | |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | | | |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | | | |
Github | 2,521 | about 1 month ago | |
DreamLLM: Synergistic Multimodal Comprehension and Creation | | | |
Github | 394 | 7 months ago | |
Coming soon | | | |
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models | | | |
Coming soon | | | |
TextBind: Multi-turn Interleaved Multimodal Instruction-following | | | |
Github | 48 | about 1 year ago | |
Demo | | | |
NExT-GPT: Any-to-Any Multimodal LLM | | | |
Github | 3,303 | 19 days ago | |
Demo | | | |
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics | | | |
Github | 19 | about 1 year ago | |
ImageBind-LLM: Multi-modality Instruction Tuning | | | |
Github | 5,754 | 8 months ago | |
Demo | | | |
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | | | |
PointLLM: Empowering Large Language Models to Understand Point Clouds | | | |
Github | 647 | 23 days ago | |
Demo | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 41 | 5 months ago | |
MLLM-DataEngine: An Iterative Refinement Approach for MLLM | | | |
Github | 36 | 6 months ago | |
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models | | | |
Github | 36 | about 1 year ago | |
Demo | | | |
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities | | | |
Github | 5,045 | 4 months ago | |
Demo | | | |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | | | |
Github | 1,089 | 5 months ago | |
Demo | | | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Github | 91 | 11 months ago | |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions | | | |
Github | 269 | 7 months ago | |
Demo | | | |
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | | | |
Github | 356 | 6 months ago | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
Github | 459 | 3 months ago | |
Demo | | | |
LISA: Reasoning Segmentation via Large Language Model | | | |
Github | 1,861 | 5 months ago | |
Demo | | | |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | | | |
Github | 525 | 23 days ago | |
3D-LLM: Injecting the 3D World into Large Language Models | | | |
Github | 961 | 6 months ago | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
Demo | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Github | 502 | over 1 year ago | |
Demo | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Github | 163 | 5 months ago | |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | | | |
Github | 506 | 5 months ago | |
Demo | | | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Github | 229 | over 1 year ago | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Github | 1,563 | about 2 months ago | |
Demo | | | |
Visual Instruction Tuning with Polite Flamingo | | | |
Github | 63 | 12 months ago | |
Demo | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Github | 258 | 5 months ago | |
Demo | | | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 744 | 5 months ago | |
Demo | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Github | 1,505 | 8 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Github | 1,550 | 5 months ago | |
Coming soon | | | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 301 | 7 months ago | |
Demo | | | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Github | 1,213 | 3 months ago | |
Demo | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,563 | 9 months ago | |
Demo | | | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | | | |
Github | 2,802 | 6 months ago | |
Demo | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Github | 1,556 | 3 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 760 | 11 months ago | |
Demo | | | |
PandaGPT: One Model To Instruction-Follow Them All | | | |
Github | 764 | over 1 year ago | |
Demo | | | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Github | 47 | about 1 year ago | |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models | | | |
Github | 508 | 10 months ago | |
DetGPT: Detect What You Need via Reasoning | | | |
Github | 755 | 4 months ago | |
Demo | | | |
Pengi: An Audio Language Model for Audio Tasks | | | |
Github | 290 | 7 months ago | |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks | | | |
Github | 915 | about 1 month ago | |
Listen, Think, and Understand | | | |
Github | 385 | 7 months ago | |
Demo | 385 | 7 months ago | |
Github | 4,094 | 3 months ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Github | 174 | 8 months ago | |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | | | |
Github | 9,926 | about 1 month ago | |
VideoChat: Chat-Centric Video Understanding | | | |
Github | 3,068 | 3 months ago | |
Demo | | | |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans | | | |
Github | 1,477 | over 1 year ago | |
Demo | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Github | 306 | over 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Github | 48 | 4 months ago | |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | | | |
Github | 5,754 | 8 months ago | |
Demo | | | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Github | 2,321 | about 1 month ago | |
Demo | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Github | 25,422 | 3 months ago | |
Visual Instruction Tuning | | | |
GitHub | 20,232 | 3 months ago | |
Demo | | | |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention | | | |
Github | 5,754 | 8 months ago | |
Demo | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Github | 133 | over 1 year ago | |
Awesome Papers / Multimodal Hallucination |
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models | | | |
Github | 27 | 7 days ago | |
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations | | | |
Github | 31 | 6 days ago | |
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs | | | |
Link | | | |
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation | | | |
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs | | | |
Github | 67 | 15 days ago | |
Evaluating and Analyzing Relationship Hallucinations in LVLMs | | | |
Github | 20 | about 1 month ago | |
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention | | | |
Github | 15 | 4 months ago | |
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models | | | |
Coming soon | | | |
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap | | | |
Coming soon | | | |
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback | | | |
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | | | |
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models | | | |
Github | 15 | about 2 months ago | |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | | |
Debiasing Multimodal Large Language Models | | | |
Github | 71 | 8 months ago | |
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding | | | |
Github | 69 | 6 months ago | |
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding | | | |
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective | | | |
Github | 31 | 24 days ago | |
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models | | | |
Github | 16 | 5 months ago | |
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs | | | |
Github | 8 | 10 months ago | |
Unified Hallucination Detection for Multimodal Large Language Models | | | |
Github | 48 | 7 months ago | |
A Survey on Hallucination in Large Vision-Language Models | | | |
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models | | | |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | | | |
Github | 79 | 10 months ago | |
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations | | | |
Github | 12 | about 1 month ago | |
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites | | | |
Github | 7 | 10 months ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 233 | 2 months ago | |
Demo | | | |
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation | | | |
Github | 287 | 3 months ago | |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding | | | |
Github | 209 | about 2 months ago | |
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization | | | |
Github | 65 | 10 months ago | |
Comins Soon | | | |
Mitigating Hallucination in Visual Language Models with Visual Supervision | | | |
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data | | | |
Github | 41 | 4 months ago | |
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | | | |
Github | 93 | 10 months ago | |
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models | | | |
Github | 25 | 13 days ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 611 | 5 months ago | |
Demo | | | |
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | | | |
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption | | | |
Github | 28 | 8 months ago | |
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models | | | |
Github | 134 | 7 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 319 | about 1 year ago | |
Demo | | | |
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models | | | |
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning | | | |
Evaluation and Analysis of Hallucination in Large Vision-Language Models | | | |
Github | 17 | about 1 year ago | |
VIGC: Visual Instruction Generation and Correction | | | |
Github | 90 | 10 months ago | |
Demo | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Github | 255 | 8 months ago | |
Demo | | | |
Evaluating Object Hallucination in Large Vision-Language Models | | | |
Github | 179 | 8 months ago | |
Awesome Papers / Multimodal In-Context Learning |
Visual In-Context Learning for Large Vision-Language Models | | | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Github | 28 | 13 days ago | |
Generative Multimodal Models are In-Context Learners | | | |
Github | 1,659 | about 2 months ago | |
Demo | | | |
Hijacking Context in Large Multi-modal Models | | | |
Towards More Unified In-context Visual Understanding | | | |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Github | 334 | 11 months ago | |
Demo | | | |
Link-Context Learning for Multimodal LLMs | | | |
Github | 89 | 6 months ago | |
Demo | | | |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | | | |
Github | 3,742 | 3 months ago | |
Demo | | | |
Med-Flamingo: a Multimodal Medical Few-shot Learner | | | |
Github | 384 | about 1 year ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,659 | about 2 months ago | |
Demo | | | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,563 | 9 months ago | |
Demo | | | |
Exploring Diverse In-Context Configurations for Image Captioning | | | |
Github | 27 | 5 months ago | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,087 | 11 months ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,712 | about 2 months ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 933 | 10 months ago | |
Demo | | | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Github | 50 | over 1 year ago | |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | | | |
Github | 267 | over 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 693 | 3 months ago | |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | | | |
Github | 84 | over 2 years ago | |
Flamingo: a Visual Language Model for Few-Shot Learning | | | |
Github | 3,742 | 3 months ago | |
Demo | | | |
Multimodal Few-Shot Learning with Frozen Language Models | | | |
Awesome Papers / Multimodal Chain-of-Thought |
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | | | |
Github | 68 | 7 months ago | |
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models | | | |
Github | 134 | about 1 month ago | |
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | | | |
Github | 33 | 8 months ago | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 744 | 5 months ago | |
Demo | | | |
Explainable Multimodal Emotion Reasoning | | | |
Github | 119 | 7 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Github | 340 | 7 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering | | | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,682 | about 1 year ago | |
Demo | | | |
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings | | | |
Coming soon | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,087 | 11 months ago | |
Demo | | | |
Chain of Thought Prompt Tuning in Vision Language Models | | | |
Coming soon | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 933 | 10 months ago | |
Demo | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,551 | 11 months ago | |
Demo | | | |
Multimodal Chain-of-Thought Reasoning in Language Models | | | |
Github | 3,810 | 5 months ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 693 | 3 months ago | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Github | 606 | 2 months ago | |
Awesome Papers / LLM-Aided Visual Reasoning |
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models | | | |
Github | 14 | about 1 month ago | |
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs | | | |
Github | 527 | 11 months ago | |
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | | | |
Github | 351 | 4 months ago | |
Demo | | | |
MM-VID: Advancing Video Understanding with GPT-4V(vision) | | | |
ControlLLM: Augment Language Models with Tools by Searching on Graphs | | | |
Github | 186 | 4 months ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 611 | 5 months ago | |
Demo | | | |
MindAgent: Emergent Gaming Interaction | | | |
Github | 74 | 5 months ago | |
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language | | | |
Github | 351 | 12 months ago | |
Demo | | | |
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | | | |
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn | | | |
Github | 65 | over 1 year ago | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 760 | 11 months ago | |
Demo | | | |
Mindstorms in Natural Language-Based Societies of Mind | | | |
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | | | |
Github | 300 | 8 months ago | |
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models | | | |
Github | 32 | about 1 year ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Github | 7 | over 1 year ago | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,682 | about 1 year ago | |
Demo | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,087 | 11 months ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,712 | about 2 months ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 933 | 10 months ago | |
Demo | | | |
ViperGPT: Visual Inference via Python Execution for Reasoning | | | |
Github | 1,660 | 10 months ago | |
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | | | |
Github | 452 | over 1 year ago | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,551 | 11 months ago | |
Demo | | | |
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners | | | |
Github | 40 | over 1 year ago | |
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | | | |
Github | 9,926 | about 1 month ago | |
Demo | | | |
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models | | | |
Github | 94 | about 1 year ago | |
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning | | | |
Github | 228 | about 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 693 | 3 months ago | |
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | | | |
Github | 34,295 | 6 days ago | |
Awesome Papers / Foundation Models |
Emu3: Next-Token Prediction is All You Need | | | |
Github | 1,820 | 28 days ago | |
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models | | | |
Demo | | | |
Pixtral-12B | | | |
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | | | |
Github | 9,926 | about 1 month ago | |
The Llama 3 Herd of Models | | | |
Chameleon: Mixed-Modal Early-Fusion Foundation Models | | | |
Hello GPT-4o | | | |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | | |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | | | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
Fuyu-8B: A Multimodal Architecture for AI Agents | | | |
Huggingface | | | |
Demo | | | |
Unified Model for Image, Video, Audio and Language Tasks | | | |
Github | 224 | 11 months ago | |
Demo | | | |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | | | |
GPT-4V(ision) System Card | | | |
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | | | |
Github | 528 | about 2 months ago | |
Multimodal Foundation Models: From Specialists to General-Purpose Assistants | | | |
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | | | |
Github | 24 | 12 months ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,659 | about 2 months ago | |
Demo | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Github | 20,176 | 12 days ago | |
Demo | | | |
Transfer Visual Prompt Generator across LLMs | | | |
Github | 269 | about 1 year ago | |
Demo | | | |
GPT-4 Technical Report | | | |
PaLM-E: An Embodied Multimodal Language Model | | | |
Demo | | | |
Prismer: A Vision-Language Model with An Ensemble of Experts | | | |
Github | 1,298 | 10 months ago | |
Demo | | | |
Language Is Not All You Need: Aligning Perception with Language Models | | | |
Github | 20,176 | 12 days ago | |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | | | |
Github | 9,926 | about 1 month ago | |
Demo | | | |
VIMA: General Robot Manipulation with Multimodal Prompts | | | |
Github | 774 | 7 months ago | |
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge | | | |
Github | 1,816 | 8 months ago | |
Write and Paint: Generative Vision-Language Models are Unified Modal Learners | | | |
Github | 43 | over 1 year ago | |
Language Models are General-Purpose Interfaces | | | |
Github | 20,176 | 12 days ago | |
Awesome Papers / Evaluation |
OmniBench: Towards The Future of Universal Omni-Language Models | | | |
Github | 14 | 16 days ago | |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Github | 78 | 7 days ago | |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Github | 2 | 3 months ago | |
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | | | |
Github | 22 | about 2 months ago | |
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs | | | |
Github | 62 | 29 days ago | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Github | 75 | about 1 month ago | |
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | | | |
Github | 94 | 4 months ago | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Github | 406 | 5 months ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Github | 27 | 10 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Github | 288 | 10 months ago | |
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | | | |
Github | 12,711 | 2 days ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Github | 83 | 3 months ago | |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | | | |
Github | 67 | 12 months ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Github | 24 | 3 months ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Github | 55 | about 1 month ago | |
VLM-Eval: A General Evaluation on Video Large Language Models | | | |
Coming soon | | | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Github | 53 | 8 months ago | |
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving | | | |
Github | 287 | 8 months ago | |
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead | | | |
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging | | | |
An Early Evaluation of GPT-4V(ision) | | | |
Github | 11 | about 1 year ago | |
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation | | | |
Github | 120 | about 1 year ago | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Github | 243 | 8 days ago | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Github | 237 | 2 months ago | |
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | | | |
Github | 14 | about 1 year ago | |
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning | | | |
Github | 20 | 9 months ago | |
Can We Edit Multimodal Large Language Models? | | | |
Github | 1,931 | 6 days ago | |
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets | | | |
Github | 11 | about 1 year ago | |
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) | | | |
TouchStone: Evaluating Vision-Language Models by Language Models | | | |
Github | 78 | 10 months ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 41 | 5 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Github | 37 | 27 days ago | |
Tiny LVLM-eHub: Early Multimodal Experiments with Bard | | | |
Github | 467 | 7 months ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Github | 267 | 17 days ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Github | 315 | 4 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Github | 163 | 3 months ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Github | 12,711 | 2 days ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Github | 467 | 7 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 301 | 7 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Github | 92 | over 1 year ago | |
On The Hidden Mystery of OCR in Large Multimodal Models | | | |
Github | 471 | about 1 month ago | |
Awesome Papers / Multimodal RLHF |
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | | | |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Github | 85 | 11 months ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 233 | 2 months ago | |
Demo | | | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 319 | about 1 year ago | |
Demo | | | |
Awesome Papers / Others |
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | | | |
Github | 45 | 3 months ago | |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models | | | |
Github | 261 | 7 months ago | |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs | | | |
Github | 132 | 4 months ago | |
Planting a SEED of Vision in Large Language Model | | | |
Github | 576 | 2 months ago | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | | | |
Github | 1,202 | 16 days ago | |
Contextual Object Detection with Multimodal Large Language Models | | | |
Github | 202 | about 1 month ago | |
Demo | | | |
Generating Images with Multimodal Language Models | | | |
Github | 430 | 10 months ago | |
On Evaluating Adversarial Robustness of Large Vision-Language Models | | | |
Github | 161 | about 1 year ago | |
Grounding Language Models to Images for Multimodal Inputs and Outputs | | | |
Github | 478 | about 1 year ago | |
Demo | | | |
Awesome Datasets / Datasets of Pre-Training for Alignment |
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | | | |
COYO-700M: Image-Text Pair Dataset | 1,163 | almost 2 years ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | | | |
Microsoft COCO: Common Objects in Context | | | |
Im2Text: Describing Images Using 1 Million Captioned Photographs | | | |
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | |
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs | | | |
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | | | |
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | | | |
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding | | | |
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | | | |
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | | | |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | | | |
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research | | | |
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline | | | |
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Awesome Datasets / Datasets of Multimodal Instruction Tuning |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Link | 2 | 3 months ago | |
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | | | |
Link | 33 | 5 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 9 months ago | |
Visually Dehallucinative Instruction Generation | | | |
Link | 5 | 8 months ago | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 57 | about 2 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Link | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Link | 18 | about 1 year ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 41 | 5 months ago | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Link | 91 | 11 months ago | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Link | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Link | | | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Link | 1,563 | about 2 months ago | |
Visual Instruction Tuning with Polite Flamingo | | | |
Link | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Link | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Link | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Link | 1,505 | 8 months ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 255 | 8 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Link | 1,550 | 5 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 301 | 7 months ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,213 | 3 months ago | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,563 | 9 months ago | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Link | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Coming soon | 1,556 | 3 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Link | 760 | 11 months ago | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Coming soon | | | |
DetGPT: Detect What You Need via Reasoning | | | |
Link | 755 | 4 months ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Coming soon | | | |
VideoChat: Chat-Centric Video Understanding | | | |
Link | 1,413 | about 2 months ago | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Link | 306 | over 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Link | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Link | | | |
Visual Instruction Tuning | | | |
Link | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Link | 133 | over 1 year ago | |
Awesome Datasets / Datasets of In-Context Learning |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Link | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,563 | 9 months ago | |
Awesome Datasets / Datasets of Multimodal Chain-of-Thought |
Explainable Multimodal Emotion Reasoning | | | |
Coming soon | 119 | 7 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Coming soon | 340 | 7 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
Coming soon | | | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Link | 606 | 2 months ago | |
Awesome Datasets / Datasets of Multimodal RLHF |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Link | | | |
Awesome Datasets / Benchmarks for Evaluation |
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content | | | |
Link | | | |
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | | | |
Link | | | |
OmniBench: Towards The Future of Universal Omni-Language Models | | | |
Link | | | |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Link | | | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Link | | | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Link | 406 | 5 months ago | |
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning | | | |
Link | 28 | 8 months ago | |
TempCompass: Do Video LLMs Really Understand Videos? | | | |
Link | 84 | 7 days ago | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 9 months ago | |
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | | | |
Link | 69 | about 1 month ago | |
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | | | |
Link | 46 | 3 months ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Link | 27 | 10 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Link | 288 | 10 months ago | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Link | | | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 57 | about 2 months ago | |
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | | | |
Link | 117 | 11 months ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Link | 24 | 3 months ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Link | 55 | about 1 month ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Link | | | |
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | | | |
Link | 84 | about 2 months ago | |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | | | |
Link | 3,068 | 3 months ago | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Link | 53 | 8 months ago | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Link | | | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Link | 243 | 8 days ago | |
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | | | |
Link | 100 | 8 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Link | | | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Link | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 41 | 5 months ago | |
Link-Context Learning for Multimodal LLMs | | | |
Link | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | | | |
Link | 356 | 6 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Link | 37 | 27 days ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Link | 267 | 17 days ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Link | 315 | 4 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Link | 163 | 3 months ago | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Link | 229 | over 1 year ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 255 | 8 months ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Link | 12,711 | 2 days ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Link | 467 | 7 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 301 | 7 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Link | 92 | over 1 year ago | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Link | 2,321 | about 1 month ago | |
Awesome Datasets / Others |
IMAD: IMage-Augmented multi-modal Dialogue | | | |
Link | 4 | over 1 year ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,213 | 3 months ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | | | |
Link | | | |
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | | | |
Link | | | |