Awesome Papers / Multimodal Instruction Tuning |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | | | |
Github | 396 | about 1 month ago | |
Apollo: An Exploration of Video Understanding in Large Multimodal Models | | | |
Github | | | |
Demo | | | |
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions | | | |
Github | 2,616 | about 1 month ago | |
StreamChat: Chatting with Streaming Video | | | |
CompCap: Improving Multimodal Large Language Models with Composite Captions | | | |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | | | |
Github | 13 | about 1 month ago | |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | | | |
Github | 6,394 | about 1 month ago | |
Demo | | | |
NVILA: Efficient Frontier Visual Language Models | | | |
Github | 2,146 | about 1 month ago | |
Demo | | | |
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | | | |
Github | 44 | about 1 month ago | |
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | | | |
Github | 67 | about 2 months ago | |
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding | | | |
Github | 106 | about 2 months ago | |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | | | |
Github | 329 | 2 months ago | |
Demo | | | |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | | | |
Github | 89 | about 2 months ago | |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | | | |
Github | 57 | 2 months ago | |
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models | | | |
Huggingface | | | |
Demo | | | |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | | | |
Github | 3,613 | about 2 months ago | |
Demo | | | |
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture | | | |
Github | 183 | 3 months ago | |
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | | | |
Github | 549 | 4 months ago | |
Demo | | | |
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | | | |
Github | 69 | 3 months ago | |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | | | |
Github | 2,365 | about 2 months ago | |
VITA: Towards Open-Source Interactive Omni Multimodal LLM | | | |
Github | 1,005 | 3 months ago | |
LLaVA-OneVision: Easy Visual Task Transfer | | | |
Github | 3,099 | 3 months ago | |
Demo | | | |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone | | | |
Github | 12,870 | 3 months ago | |
Demo | | | |
VILA^2: VILA Augmented VILA | | | |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | | | |
EVLM: An Efficient Vision-Language Model for Visual Understanding | | | |
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model | | | |
Github | 26 | about 2 months ago | |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | | | |
Github | 2,616 | about 1 month ago | |
Demo | | | |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | | | |
Github | 1,336 | about 1 month ago | |
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming | | | |
Github | 9 | about 1 month ago | |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | | | |
Github | 1,799 | 3 months ago | |
Long Context Transfer from Language to Vision | | | |
Github | 347 | about 2 months ago | |
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | | | |
Github | 1,091 | about 1 month ago | |
TroL: Traversal of Layers for Large Language and Vision Models | | | |
Github | 88 | 7 months ago | |
Unveiling Encoder-Free Vision-Language Models | | | |
Github | 246 | 4 months ago | |
VideoLLM-online: Online Video Large Language Model for Streaming Video | | | |
Github | 251 | 5 months ago | |
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics | | | |
Github | 64 | 3 months ago | |
Demo | | | |
Comparison Visual Instruction Tuning | | | |
Github | | | |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models | | | |
Github | 143 | 2 months ago | |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | | | |
Github | 957 | about 2 months ago | |
Parrot: Multilingual Visual Instruction Tuning | | | |
Github | 34 | 5 months ago | |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model | | | |
Github | 575 | about 2 months ago | |
Matryoshka Query Transformer for Large Vision-Language Models | | | |
Github | 101 | 7 months ago | |
Demo | | | |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | | | |
Github | 106 | 6 months ago | |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | | | |
Github | 102 | 8 months ago | |
Demo | | | |
Libra: Building Decoupled Vision System on Large Language Models | | | |
Github | 153 | about 2 months ago | |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | | | |
Github | 136 | 7 months ago | |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | | | |
Github | 6,394 | about 1 month ago | |
Demo | | | |
Graphic Design with Large Multimodal Model | | | |
Github | 102 | 9 months ago | |
BRAVE: Broadening the visual encoding of vision-language models | | | |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | | | |
Github | 2,616 | about 1 month ago | |
Demo | | | |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | | | |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | | | |
Github | 254 | 6 months ago | |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing | | | |
Github | 406 | 3 months ago | |
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model | | | |
LITA: Language Instructed Temporal-Localization Assistant | | | |
Github | 151 | 3 months ago | |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | | | |
Github | 3,229 | 9 months ago | |
Demo | | | |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | | |
MoAI: Mixture of All Intelligence for Large Language and Vision Models | | | |
Github | 314 | 10 months ago | |
DeepSeek-VL: Towards Real-World Vision-Language Understanding | | | |
Github | 2,145 | 9 months ago | |
Demo | | | |
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | | | |
Github | 1,849 | about 2 months ago | |
Demo | | | |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | | | |
Github | 466 | 5 months ago | |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | | |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | | | |
Github | 798 | 5 months ago | |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning | | | |
Github | 58 | about 2 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Github | 249 | 7 months ago | |
Demo | | | |
CoLLaVO: Crayon Large Language and Vision mOdel | | | |
Github | 93 | 7 months ago | |
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models | | | |
Github | 494 | 7 months ago | |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | | | |
Github | 153 | 7 months ago | |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model | | | |
Github | 1,076 | 9 months ago | |
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning | | | |
Github | 43 | 2 months ago | |
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study | | | |
Coming soon | | | |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | | | |
Github | 20,683 | 5 months ago | |
Demo | | | |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | | | |
Github | 2,023 | about 2 months ago | |
Demo | | | |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | | | |
Github | 2,616 | about 1 month ago | |
Demo | | | |
Yi-VL | 7,743 | about 2 months ago | |
Github | 7,743 | about 2 months ago | |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | | | |
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning | | | |
Github | 108 | 4 months ago | |
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices | | | |
Github | 1,076 | 9 months ago | |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | | | |
Github | 6,394 | about 1 month ago | |
Demo | | | |
Osprey: Pixel Understanding with Visual Instruction Tuning | | | |
Github | 781 | 6 months ago | |
Demo | | | |
CogAgent: A Visual Language Model for GUI Agents | | | |
Github | 6,182 | 8 months ago | |
Coming soon | | | |
Pixel Aligned Language Models | | | |
Coming soon | | | |
VILA: On Pre-training for Visual Language Models | | | |
Github | 2,146 | about 1 month ago | |
See, Say, and Segment: Teaching LMMs to Overcome False Premises | | | |
Coming soon | | | |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | | | |
Github | 1,831 | about 2 months ago | |
Demo | | | |
Honeybee: Locality-enhanced Projector for Multimodal LLM | | | |
Github | 435 | 8 months ago | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
OneLLM: One Framework to Align All Modalities with Language | | | |
Github | 601 | 3 months ago | |
Demo | | | |
Lenna: Language Enhanced Reasoning Detection Assistant | | | |
Github | 78 | 12 months ago | |
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | | | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Github | 314 | about 2 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Github | 302 | 6 months ago | |
Demo | | | |
Dolphins: Multimodal Language Model for Driving | | | |
Github | 51 | 6 months ago | |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning | | | |
Github | 255 | 6 months ago | |
Coming soon | | | |
VTimeLLM: Empower LLM to Grasp Video Moments | | | |
Github | 231 | 7 months ago | |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | | | |
Github | 1,958 | 4 months ago | |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | | | |
Github | 748 | 6 months ago | |
Coming soon | | | |
LLMGA: Multimodal Large Language Model based Generation Assistant | | | |
Github | 463 | 5 months ago | |
Demo | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Github | 202 | about 1 year ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
Github | 2,616 | about 1 month ago | |
Demo | | | |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | | | |
Github | 124 | 6 months ago | |
An Embodied Generalist Agent in 3D World | | | |
Github | 379 | 3 months ago | |
Demo | | | |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | | | |
Github | 3,071 | about 2 months ago | |
Demo | | | |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | | | |
Github | 895 | 3 months ago | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Github | 131 | about 1 year ago | |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | | | |
Github | 2,732 | 8 months ago | |
Demo | | | |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | | | |
Github | 1,849 | about 2 months ago | |
Demo | | | |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | | | |
Github | 717 | 12 months ago | |
Demo | | | |
NExT-Chat: An LMM for Chat, Detection and Segmentation | | | |
Github | 227 | 12 months ago | |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | | | |
Github | 2,365 | about 2 months ago | |
Demo | | | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Github | 3,570 | 11 months ago | |
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | | | |
Coming soon | | | |
GLaMM: Pixel Grounding Large Multimodal Model | | | |
Github | 797 | about 2 months ago | |
Demo | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Github | 18 | about 1 year ago | |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | | | |
Github | 25,490 | 5 months ago | |
SALMONN: Towards Generic Hearing Abilities for Large Language Models | | | |
Github | 1,091 | about 1 month ago | |
Ferret: Refer and Ground Anything Anywhere at Any Granularity | | | |
Github | 8,509 | 3 months ago | |
CogVLM: Visual Expert For Large Language Models | | | |
Github | 6,182 | 8 months ago | |
Demo | | | |
Improved Baselines with Visual Instruction Tuning | | | |
Github | 20,683 | 5 months ago | |
Demo | | | |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | | | |
Github | 751 | 10 months ago | |
Demo | | | |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | | | |
Github | 79 | 7 months ago | |
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants | | | |
Github | 59 | 12 months ago | |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | | | |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | | | |
Github | 2,616 | about 1 month ago | |
DreamLLM: Synergistic Multimodal Comprehension and Creation | | | |
Github | 402 | about 2 months ago | |
Coming soon | | | |
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models | | | |
Coming soon | | | |
TextBind: Multi-turn Interleaved Multimodal Instruction-following | | | |
Github | 47 | over 1 year ago | |
Demo | | | |
NExT-GPT: Any-to-Any Multimodal LLM | | | |
Github | 3,344 | 3 months ago | |
Demo | | | |
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics | | | |
Github | 19 | over 1 year ago | |
ImageBind-LLM: Multi-modality Instruction Tuning | | | |
Github | 5,775 | 10 months ago | |
Demo | | | |
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | | | |
PointLLM: Empowering Large Language Models to Understand Point Clouds | | | |
Github | 670 | 3 months ago | |
Demo | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 43 | 7 months ago | |
MLLM-DataEngine: An Iterative Refinement Approach for MLLM | | | |
Github | 39 | 8 months ago | |
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models | | | |
Github | 37 | over 1 year ago | |
Demo | | | |
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities | | | |
Github | 5,179 | 5 months ago | |
Demo | | | |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | | | |
Github | 1,098 | 7 months ago | |
Demo | | | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Github | 93 | about 1 year ago | |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions | | | |
Github | 270 | 9 months ago | |
Demo | | | |
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions | | | |
Github | 360 | 8 months ago | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
Github | 466 | 5 months ago | |
Demo | | | |
LISA: Reasoning Segmentation via Large Language Model | | | |
Github | 1,923 | 7 months ago | |
Demo | | | |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | | | |
Github | 550 | about 1 month ago | |
3D-LLM: Injecting the 3D World into Large Language Models | | | |
Github | 979 | 8 months ago | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
Demo | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Github | 505 | over 1 year ago | |
Demo | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Github | 164 | 7 months ago | |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | | | |
Github | 517 | 7 months ago | |
Demo | | | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Github | 231 | over 1 year ago | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Github | 1,958 | 4 months ago | |
Demo | | | |
Visual Instruction Tuning with Polite Flamingo | | | |
Github | 63 | about 1 year ago | |
Demo | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Github | 259 | 7 months ago | |
Demo | | | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 748 | 6 months ago | |
Demo | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Github | 1,531 | 10 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Github | 1,568 | 7 months ago | |
Coming soon | | | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 305 | 9 months ago | |
Demo | | | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Github | 1,246 | 5 months ago | |
Demo | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,570 | 11 months ago | |
Demo | | | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | | | |
Github | 2,842 | 8 months ago | |
Demo | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Github | 1,622 | 5 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 762 | about 1 year ago | |
Demo | | | |
PandaGPT: One Model To Instruction-Follow Them All | | | |
Github | 772 | over 1 year ago | |
Demo | | | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Github | 49 | over 1 year ago | |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models | | | |
Github | 513 | 12 months ago | |
DetGPT: Detect What You Need via Reasoning | | | |
Github | 761 | 5 months ago | |
Demo | | | |
Pengi: An Audio Language Model for Audio Tasks | | | |
Github | 295 | 9 months ago | |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks | | | |
Github | 956 | 3 months ago | |
Listen, Think, and Understand | | | |
Github | 396 | 9 months ago | |
Demo | 396 | 9 months ago | |
Github | 4,110 | 5 months ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Github | 180 | about 1 month ago | |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | | | |
Github | 10,058 | 2 months ago | |
VideoChat: Chat-Centric Video Understanding | | | |
Github | 3,106 | about 2 months ago | |
Demo | | | |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans | | | |
Github | 1,478 | over 1 year ago | |
Demo | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Github | 308 | over 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Github | 48 | 6 months ago | |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | | | |
Github | 5,775 | 10 months ago | |
Demo | | | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Github | 2,365 | about 2 months ago | |
Demo | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Github | 25,490 | 5 months ago | |
Visual Instruction Tuning | | | |
GitHub | 20,683 | 5 months ago | |
Demo | | | |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention | | | |
Github | 5,775 | 10 months ago | |
Demo | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Github | 134 | over 1 year ago | |
Awesome Papers / Multimodal Hallucination |
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models | | | |
Github | 28 | about 2 months ago | |
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations | | | |
Github | 46 | 2 months ago | |
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs | | | |
Link | | | |
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation | | | |
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs | | | |
Github | 83 | 2 months ago | |
Evaluating and Analyzing Relationship Hallucinations in LVLMs | | | |
Github | 20 | 3 months ago | |
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention | | | |
Github | 18 | 6 months ago | |
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models | | | |
Coming soon | | | |
Mitigating Object Hallucination via Data Augmented Contrastive Tuning | | | |
Coming soon | | | |
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap | | | |
Coming soon | | | |
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback | | | |
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | | | |
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models | | | |
Github | 15 | 4 months ago | |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | | |
Debiasing Multimodal Large Language Models | | | |
Github | 75 | 10 months ago | |
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding | | | |
Github | 72 | about 2 months ago | |
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding | | | |
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective | | | |
Github | 39 | 3 months ago | |
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models | | | |
Github | 19 | 7 months ago | |
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs | | | |
Github | 8 | 12 months ago | |
Unified Hallucination Detection for Multimodal Large Language Models | | | |
Github | 48 | 9 months ago | |
A Survey on Hallucination in Large Vision-Language Models | | | |
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models | | | |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | | | |
Github | 82 | 12 months ago | |
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations | | | |
Github | 13 | 3 months ago | |
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites | | | |
Github | 8 | 12 months ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 245 | 4 months ago | |
Demo | | | |
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation | | | |
Github | 293 | 5 months ago | |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding | | | |
Github | 222 | 3 months ago | |
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization | | | |
Github | 73 | 12 months ago | |
Comins Soon | | | |
Mitigating Hallucination in Visual Language Models with Visual Supervision | | | |
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data | | | |
Github | 41 | 6 months ago | |
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | | | |
Github | 98 | about 1 year ago | |
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models | | | |
Github | 27 | 2 months ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 617 | 7 months ago | |
Demo | | | |
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | | | |
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption | | | |
Github | 28 | 9 months ago | |
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models | | | |
Github | 136 | 9 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 328 | about 1 year ago | |
Demo | | | |
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models | | | |
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning | | | |
Evaluation and Analysis of Hallucination in Large Vision-Language Models | | | |
Github | 17 | over 1 year ago | |
VIGC: Visual Instruction Generation and Correction | | | |
Github | 91 | 12 months ago | |
Demo | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Github | 262 | 10 months ago | |
Demo | | | |
Evaluating Object Hallucination in Large Vision-Language Models | | | |
Github | 187 | 10 months ago | |
Awesome Papers / Multimodal In-Context Learning |
Visual In-Context Learning for Large Vision-Language Models | | | |
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model | | | |
Github | 76 | 3 months ago | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Github | 30 | 2 months ago | |
Generative Multimodal Models are In-Context Learners | | | |
Github | 1,672 | 4 months ago | |
Demo | | | |
Hijacking Context in Large Multi-modal Models | | | |
Towards More Unified In-context Visual Understanding | | | |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Github | 337 | about 1 year ago | |
Demo | | | |
Link-Context Learning for Multimodal LLMs | | | |
Github | 91 | 8 months ago | |
Demo | | | |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | | | |
Github | 3,781 | 5 months ago | |
Demo | | | |
Med-Flamingo: a Multimodal Medical Few-shot Learner | | | |
Github | 396 | over 1 year ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,672 | 4 months ago | |
Demo | | | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Github | 3,570 | 11 months ago | |
Demo | | | |
Exploring Diverse In-Context Configurations for Image Captioning | | | |
Github | 33 | about 2 months ago | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,095 | about 1 year ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,801 | 4 months ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 940 | 12 months ago | |
Demo | | | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Github | 50 | over 1 year ago | |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | | | |
Github | 270 | over 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 697 | 5 months ago | |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | | | |
Github | 85 | almost 3 years ago | |
Flamingo: a Visual Language Model for Few-Shot Learning | | | |
Github | 3,781 | 5 months ago | |
Demo | | | |
Multimodal Few-Shot Learning with Frozen Language Models | | | |
Awesome Papers / Multimodal Chain-of-Thought |
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models | | | |
Github | 113 | about 2 months ago | |
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | | | |
Github | 73 | 8 months ago | |
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models | | | |
Github | 162 | about 2 months ago | |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | | | |
Github | 90 | 7 months ago | |
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | | | |
Github | 35 | 10 months ago | |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | | | |
Github | 748 | 6 months ago | |
Demo | | | |
Explainable Multimodal Emotion Reasoning | | | |
Github | 123 | 9 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Github | 346 | 9 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering | | | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,693 | over 1 year ago | |
Demo | | | |
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings | | | |
Coming soon | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,095 | about 1 year ago | |
Demo | | | |
Chain of Thought Prompt Tuning in Vision Language Models | | | |
Coming soon | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 940 | 12 months ago | |
Demo | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,555 | about 1 year ago | |
Demo | | | |
Multimodal Chain-of-Thought Reasoning in Language Models | | | |
Github | 3,833 | 7 months ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 697 | 5 months ago | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Github | 615 | 4 months ago | |
Awesome Papers / LLM-Aided Visual Reasoning |
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models | | | |
Github | 14 | 3 months ago | |
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs | | | |
Github | 541 | about 1 year ago | |
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | | | |
Github | 353 | 6 months ago | |
Demo | | | |
MM-VID: Advancing Video Understanding with GPT-4V(vision) | | | |
ControlLLM: Augment Language Models with Tools by Searching on Graphs | | | |
Github | 187 | 6 months ago | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models | | | |
Github | 617 | 7 months ago | |
Demo | | | |
MindAgent: Emergent Gaming Interaction | | | |
Github | 79 | 7 months ago | |
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language | | | |
Github | 352 | about 1 year ago | |
Demo | | | |
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | | | |
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn | | | |
Github | 66 | over 1 year ago | |
AVIS: Autonomous Visual Information Seeking with Large Language Models | | | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Github | 762 | about 1 year ago | |
Demo | | | |
Mindstorms in Natural Language-Based Societies of Mind | | | |
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | | | |
Github | 306 | 9 months ago | |
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models | | | |
Github | 32 | over 1 year ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Github | 7 | over 1 year ago | |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls | | | |
Github | 1,693 | over 1 year ago | |
Demo | | | |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | | | |
Github | 1,095 | about 1 year ago | |
Demo | | | |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace | | | |
Github | 23,801 | 4 months ago | |
Demo | | | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | | | |
Github | 940 | 12 months ago | |
Demo | | | |
ViperGPT: Visual Inference via Python Execution for Reasoning | | | |
Github | 1,666 | 12 months ago | |
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | | | |
Github | 457 | almost 2 years ago | |
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction | | | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | | | |
Github | 34,555 | about 1 year ago | |
Demo | | | |
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners | | | |
Github | 41 | over 1 year ago | |
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | | | |
Github | 10,058 | 2 months ago | |
Demo | | | |
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models | | | |
Github | 94 | over 1 year ago | |
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning | | | |
Github | 235 | over 1 year ago | |
Visual Programming: Compositional visual reasoning without training | | | |
Github | 697 | 5 months ago | |
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | | | |
Github | 34,478 | about 1 month ago | |
Awesome Papers / Foundation Models |
Emu3: Next-Token Prediction is All You Need | | | |
Github | 1,911 | 3 months ago | |
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models | | | |
Demo | | | |
Pixtral-12B | | | |
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | | | |
Github | 10,058 | 2 months ago | |
The Llama 3 Herd of Models | | | |
Chameleon: Mixed-Modal Early-Fusion Foundation Models | | | |
Hello GPT-4o | | | |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | | |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | | | |
Gemini: A Family of Highly Capable Multimodal Models | | | |
Fuyu-8B: A Multimodal Architecture for AI Agents | | | |
Huggingface | | | |
Demo | | | |
Unified Model for Image, Video, Audio and Language Tasks | | | |
Github | 224 | about 1 year ago | |
Demo | | | |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | | | |
GPT-4V(ision) System Card | | | |
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | | | |
Github | 544 | 3 months ago | |
Multimodal Foundation Models: From Specialists to General-Purpose Assistants | | | |
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | | | |
Github | 24 | about 1 year ago | |
Generative Pretraining in Multimodality | | | |
Github | 1,672 | 4 months ago | |
Demo | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Github | 20,400 | about 1 month ago | |
Demo | | | |
Transfer Visual Prompt Generator across LLMs | | | |
Github | 270 | over 1 year ago | |
Demo | | | |
GPT-4 Technical Report | | | |
PaLM-E: An Embodied Multimodal Language Model | | | |
Demo | | | |
Prismer: A Vision-Language Model with An Ensemble of Experts | | | |
Github | 1,299 | about 1 year ago | |
Demo | | | |
Language Is Not All You Need: Aligning Perception with Language Models | | | |
Github | 20,400 | about 1 month ago | |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | | | |
Github | 10,058 | 2 months ago | |
Demo | | | |
VIMA: General Robot Manipulation with Multimodal Prompts | | | |
Github | 781 | 9 months ago | |
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge | | | |
Github | 1,843 | 10 months ago | |
Write and Paint: Generative Vision-Language Models are Unified Modal Learners | | | |
Github | 43 | over 1 year ago | |
Language Models are General-Purpose Interfaces | | | |
Github | 20,400 | about 1 month ago | |
Awesome Papers / Evaluation |
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective | | | |
Github | 106 | about 2 months ago | |
OmniBench: Towards The Future of Universal Omni-Language Models | | | |
Github | 15 | 2 months ago | |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Github | 86 | about 2 months ago | |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Github | 3 | 5 months ago | |
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | | | |
Github | 22 | 4 months ago | |
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs | | | |
Github | 67 | 3 months ago | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Github | 85 | 3 months ago | |
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | | | |
Github | 95 | 6 months ago | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Github | 422 | about 1 month ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Github | 27 | 12 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Github | 296 | 12 months ago | |
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | | | |
Github | 13,117 | about 1 month ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Github | 84 | 5 months ago | |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | | | |
Github | 72 | about 1 year ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Github | 24 | 4 months ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Github | 56 | 3 months ago | |
VLM-Eval: A General Evaluation on Video Large Language Models | | | |
Coming soon | | | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Github | 53 | 10 months ago | |
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving | | | |
Github | 288 | 10 months ago | |
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead | | | |
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging | | | |
An Early Evaluation of GPT-4V(ision) | | | |
Github | 11 | about 1 year ago | |
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation | | | |
Github | 121 | about 1 year ago | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Github | 259 | 2 months ago | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Github | 253 | about 2 months ago | |
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | | | |
Github | 14 | about 1 year ago | |
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning | | | |
Github | 21 | 11 months ago | |
Can We Edit Multimodal Large Language Models? | | | |
Github | 1,981 | about 1 month ago | |
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets | | | |
Github | 11 | over 1 year ago | |
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) | | | |
TouchStone: Evaluating Vision-Language Models by Language Models | | | |
Github | 79 | 12 months ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Github | 43 | 7 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Github | 38 | 3 months ago | |
Tiny LVLM-eHub: Early Multimodal Experiments with Bard | | | |
Github | 478 | 9 months ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Github | 274 | 2 months ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Github | 322 | 6 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Github | 168 | 5 months ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Github | 13,117 | about 1 month ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Github | 478 | 9 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Github | 305 | 9 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Github | 93 | over 1 year ago | |
On The Hidden Mystery of OCR in Large Multimodal Models | | | |
Github | 484 | 3 months ago | |
Awesome Papers / Multimodal RLHF |
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | | | |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Github | 88 | about 1 year ago | |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | | | |
Github | 245 | 4 months ago | |
Demo | | | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Github | 328 | about 1 year ago | |
Demo | | | |
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data | | | |
Github | 2 | 3 months ago | |
Awesome Papers / Others |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | | | |
Github | 7 | about 1 month ago | |
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | | | |
Github | 47 | 5 months ago | |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models | | | |
Github | 266 | 9 months ago | |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs | | | |
Github | 135 | 6 months ago | |
Planting a SEED of Vision in Large Language Model | | | |
Github | 585 | 4 months ago | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | | | |
Github | 1,218 | 2 months ago | |
Contextual Object Detection with Multimodal Large Language Models | | | |
Github | 208 | 3 months ago | |
Demo | | | |
Generating Images with Multimodal Language Models | | | |
Github | 440 | 12 months ago | |
On Evaluating Adversarial Robustness of Large Vision-Language Models | | | |
Github | 165 | about 1 year ago | |
Grounding Language Models to Images for Multimodal Inputs and Outputs | | | |
Github | 478 | about 1 year ago | |
Demo | | | |
Awesome Datasets / Datasets of Pre-Training for Alignment |
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | | | |
COYO-700M: Image-Text Pair Dataset | 1,172 | about 2 years ago | |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | | | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | | | |
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | | | |
Microsoft COCO: Common Objects in Context | | | |
Im2Text: Describing Images Using 1 Million Captioned Photographs | | | |
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | | | |
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs | | | |
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | | | |
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | | | |
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding | | | |
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | | | |
Kosmos-2: Grounding Multimodal Large Language Models to the World | | | |
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | | | |
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | | | |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | | | |
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research | | | |
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline | | | |
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | | | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Awesome Datasets / Datasets of Multimodal Instruction Tuning |
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | | | |
Link | 42 | 2 months ago | |
Multi-modal Situated Reasoning in 3D Scenes | | | |
Link | | | |
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct | | | |
Link | | | |
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | | | |
Link | 3 | 5 months ago | |
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | | | |
Link | 33 | 6 months ago | |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 11 months ago | |
Visually Dehallucinative Instruction Generation | | | |
Link | 5 | 10 months ago | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 58 | 4 months ago | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | | | |
Link | | | |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | | | |
Link | 18 | about 1 year ago | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 43 | 7 months ago | |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | | | |
Link | 93 | about 1 year ago | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | | | |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | | | |
Link | | | |
SVIT: Scaling up Visual Instruction Tuning | | | |
Link | | | |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | | | |
Link | 1,958 | 4 months ago | |
Visual Instruction Tuning with Polite Flamingo | | | |
Link | | | |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | | | |
Link | | | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | | | |
Link | | | |
MotionGPT: Human Motion as a Foreign Language | | | |
Link | 1,531 | 10 months ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 262 | 10 months ago | |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | | | |
Link | 1,568 | 7 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 305 | 9 months ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,246 | 5 months ago | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,570 | 11 months ago | |
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | | | |
Link | | | |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | | | |
Coming soon | 1,622 | 5 months ago | |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | | | |
Link | 762 | about 1 year ago | |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | | | |
Coming soon | | | |
DetGPT: Detect What You Need via Reasoning | | | |
Link | 761 | 5 months ago | |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | | | |
Coming soon | | | |
VideoChat: Chat-Centric Video Understanding | | | |
Link | 1,467 | about 1 month ago | |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | | | |
Link | 308 | over 1 year ago | |
LMEye: An Interactive Perception Network for Large Language Models | | | |
Link | | | |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | | | |
Link | | | |
Visual Instruction Tuning | | | |
Link | | | |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | | | |
Link | 134 | over 1 year ago | |
Awesome Datasets / Datasets of In-Context Learning |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | | | |
Link | | | |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | | | |
Link | 3,570 | 11 months ago | |
Awesome Datasets / Datasets of Multimodal Chain-of-Thought |
Explainable Multimodal Emotion Reasoning | | | |
Coming soon | 123 | 9 months ago | |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | | | |
Coming soon | 346 | 9 months ago | |
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | | | |
Coming soon | | | |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | | | |
Link | 615 | 4 months ago | |
Awesome Datasets / Datasets of Multimodal RLHF |
Silkie: Preference Distillation for Large Visual Language Models | | | |
Link | | | |
Awesome Datasets / Benchmarks for Evaluation |
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought | | | |
Link | 47 | 8 months ago | |
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective | | | |
Link | 106 | about 2 months ago | |
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps | | | |
Link | 3 | 2 months ago | |
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content | | | |
Link | | | |
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | | | |
Link | | | |
OmniBench: Towards The Future of Universal Omni-Language Models | | | |
Link | | | |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? | | | |
Link | | | |
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? | | | |
Link | 5 | 5 months ago | |
Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions | | | |
Link | 43 | 3 months ago | |
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | | | |
Link | | | |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | | | |
Link | 422 | about 1 month ago | |
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning | | | |
Link | 31 | 9 months ago | |
TempCompass: Do Video LLMs Really Understand Videos? | | | |
Link | 91 | 2 months ago | |
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning | | | |
Link | | | |
Can MLLMs Perform Text-to-Image In-Context Learning? | | | |
Link | | | |
Visually Dehallucinative Instruction Generation: Know What You Don't Know | | | |
Link | 6 | 11 months ago | |
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | | | |
Link | 74 | 3 months ago | |
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval | | | |
Link | 22 | 6 months ago | |
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | | | |
Link | 46 | 4 months ago | |
Benchmarking Large Multimodal Models against Common Corruptions | | | |
Link | 27 | 12 months ago | |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | | | |
Link | 296 | 12 months ago | |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | | | |
Link | | | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | | | |
Link | | | |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | | | |
Link | 58 | 4 months ago | |
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | | | |
Link | 121 | about 1 year ago | |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | | | |
Link | 24 | 4 months ago | |
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | | | |
Link | 56 | 3 months ago | |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | | | |
Link | | | |
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | | | |
Link | 87 | 4 months ago | |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | | | |
Link | 3,106 | about 2 months ago | |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | | | |
Link | 53 | 10 months ago | |
OtterHD: A High-Resolution Multi-modality Model | | | |
Link | | | |
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | | | |
Link | 259 | 2 months ago | |
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | | | |
Link | 99 | 10 months ago | |
Aligning Large Multimodal Models with Factually Augmented RLHF | | | |
Link | | | |
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | | | |
Link | | | |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | | | |
Link | 43 | 7 months ago | |
Link-Context Learning for Multimodal LLMs | | | |
Link | | | |
Detecting and Preventing Hallucinations in Large Vision Language Models | | | |
Coming soon | | | |
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | | | |
Link | 360 | 8 months ago | |
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | | | |
Link | 38 | 3 months ago | |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | | | |
Link | 274 | 2 months ago | |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | | | |
Link | 322 | 6 months ago | |
MMBench: Is Your Multi-modal Model an All-around Player? | | | |
Link | 168 | 5 months ago | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | | | |
Link | 231 | over 1 year ago | |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | | | |
Link | 262 | 10 months ago | |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | | | |
Link | 13,117 | about 1 month ago | |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | | | |
Link | 478 | 9 months ago | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | | | |
Link | 305 | 9 months ago | |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | | | |
Link | 93 | over 1 year ago | |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | | | |
Link | 2,365 | about 2 months ago | |
Awesome Datasets / Others |
IMAD: IMage-Augmented multi-modal Dialogue | | | |
Link | 4 | over 1 year ago | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | | | |
Link | 1,246 | 5 months ago | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | | | |
Link | | | |
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | | | |
Link | | | |
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | | | |
Link | | | |