Awesome-Multimodal-Large-Language-Models

MLM Hub

A curated collection of papers, projects, and resources on large language models that incorporate multiple forms of input and output.

sparklessparklesLatest Advances on Multimodal Large Language Models

GitHub

13k stars
271 watching
809 forks
last commit: 2 days ago
Linked from 2 awesome lists

chain-of-thoughtin-context-learninginstruction-followinginstruction-tuninglarge-language-modelslarge-vision-language-modellarge-vision-language-modelsmulti-modalitymultimodal-chain-of-thoughtmultimodal-in-context-learningmultimodal-instruction-tuningmultimodal-large-language-modelsvisual-instruction-tuning

Awesome Papers / Multimodal Instruction Tuning

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Github 270 16 days ago
Demo
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Github 80 24 days ago
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Huggingface
Demo
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Github 3,093 about 2 months ago
Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Github 179 about 1 month ago
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Github 539 2 months ago
Demo
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Github 2,321 about 1 month ago
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Github 961 28 days ago
LLaVA-OneVision: Easy Visual Task Transfer
Github 2,872 about 1 month ago
Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Github 12,619 about 1 month ago
Demo
VILA^2: VILA Augmented VILA
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
EVLM: An Efficient Vision-Language Model for Visual Understanding
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Github 25 about 1 month ago
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Github 2,521 about 1 month ago
Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Github 1,300 about 2 months ago
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Github 1,759 22 days ago
Long Context Transfer from Language to Vision
Github 334 26 days ago
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Github 1,053 15 days ago
Unveiling Encoder-Free Vision-Language Models
Github 230 about 2 months ago
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Github 53 about 1 month ago
Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Github 137 14 days ago
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Github 871 8 days ago
Parrot: Multilingual Visual Instruction Tuning
Github 30 3 months ago
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Github 517 17 days ago
Matryoshka Query Transformer for Large Vision-Language Models
Github 97 5 months ago
Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Github 104 4 months ago
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Github 102 6 months ago
Demo
Libra: Building Decoupled Vision System on Large Language Models
Github 143 about 1 month ago
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Github 134 6 months ago
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Github 6,014 6 days ago
Demo
Graphic Design with Large Multimodal Model
Github 98 7 months ago
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Github 2,521 about 1 month ago
Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Github 244 4 months ago
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Github 3,211 7 months ago
Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Github 311 8 months ago
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Github 1,825 9 days ago
Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Github 459 3 months ago
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Github 779 3 months ago
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Github 54 5 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Github 246 5 months ago
Demo
CoLLaVO: Crayon Large Language and Vision mOdel
Github 93 5 months ago
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Github 152 5 months ago
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Github 1,039 7 months ago
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Github 40 6 days ago
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Coming soon
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Github 20,232 3 months ago
Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Github 1,980 6 months ago
Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Github 2,521 about 1 month ago
Demo
Yi-VL 7,699 11 days ago
Github 7,699 11 days ago
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Github 107 3 months ago
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Github 1,039 7 months ago
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Github 6,014 6 days ago
Demo
Osprey: Pixel Understanding with Visual Instruction Tuning
Github 770 4 months ago
Demo
CogAgent: A Visual Language Model for GUI Agents
Github 6,080 6 months ago
Coming soon
Pixel Aligned Language Models
Coming soon
See, Say, and Segment: Teaching LMMs to Overcome False Premises
Coming soon
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Github 1,817 about 2 months ago
Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM
Github 432 7 months ago
Gemini: A Family of Highly Capable Multimodal Models
OneLLM: One Framework to Align All Modalities with Language
Github 588 about 1 month ago
Demo
Lenna: Language Enhanced Reasoning Detection Assistant
Github 78 10 months ago
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Github 286 6 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Github 294 4 months ago
Demo
Dolphins: Multimodal Language Model for Driving
Github 42 4 months ago
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Github 248 4 months ago
Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments
Github 225 5 months ago
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Github 1,563 about 2 months ago
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Github 733 4 months ago
Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant
Github 461 3 months ago
Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Github 196 12 months ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Github 2,521 about 1 month ago
Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Github 121 4 months ago
An Embodied Generalist Agent in 3D World
Github 365 about 1 month ago
Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Github 2,990 about 2 months ago
Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Github 847 about 1 month ago
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Github 131 11 months ago
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Github 2,720 6 months ago
Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Github 1,825 9 days ago
Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Github 704 10 months ago
Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation
Github 217 10 months ago
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Github 2,321 about 1 month ago
Demo
OtterHD: A High-Resolution Multi-modality Model
Github 3,563 9 months ago
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Coming soon
GLaMM: Pixel Grounding Large Multimodal Model
Github 781 6 months ago
Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Github 18 about 1 year ago
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Github 25,422 3 months ago
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Github 1,053 15 days ago
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Github 8,476 about 1 month ago
CogVLM: Visual Expert For Large Language Models
Github 6,080 6 months ago
Demo
Improved Baselines with Visual Instruction Tuning
Github 20,232 3 months ago
Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Github 723 8 months ago
Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Github 76 5 months ago
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Github 57 10 months ago
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Github 2,521 about 1 month ago
DreamLLM: Synergistic Multimodal Comprehension and Creation
Github 394 7 months ago
Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models
Coming soon
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Github 48 about 1 year ago
Demo
NExT-GPT: Any-to-Any Multimodal LLM
Github 3,303 19 days ago
Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Github 19 about 1 year ago
ImageBind-LLM: Multi-modality Instruction Tuning
Github 5,754 8 months ago
Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
PointLLM: Empowering Large Language Models to Understand Point Clouds
Github 647 23 days ago
Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 41 5 months ago
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Github 36 6 months ago
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
Github 36 about 1 year ago
Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Github 5,045 4 months ago
Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Github 1,089 5 months ago
Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Github 91 11 months ago
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Github 269 7 months ago
Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Github 356 6 months ago
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Github 459 3 months ago
Demo
LISA: Reasoning Segmentation via Large Language Model
Github 1,861 5 months ago
Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Github 525 23 days ago
3D-LLM: Injecting the 3D World into Large Language Models
Github 961 6 months ago
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Github 502 over 1 year ago
Demo
SVIT: Scaling up Visual Instruction Tuning
Github 163 5 months ago
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Github 506 5 months ago
Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Github 229 over 1 year ago
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Github 1,563 about 2 months ago
Demo
Visual Instruction Tuning with Polite Flamingo
Github 63 12 months ago
Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Github 258 5 months ago
Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 744 5 months ago
Demo
MotionGPT: Human Motion as a Foreign Language
Github 1,505 8 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Github 1,550 5 months ago
Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 301 7 months ago
Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Github 1,213 3 months ago
Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,563 9 months ago
Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Github 2,802 6 months ago
Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Github 1,556 3 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 760 11 months ago
Demo
PandaGPT: One Model To Instruction-Follow Them All
Github 764 over 1 year ago
Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Github 47 about 1 year ago
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Github 508 10 months ago
DetGPT: Detect What You Need via Reasoning
Github 755 4 months ago
Demo
Pengi: An Audio Language Model for Audio Tasks
Github 290 7 months ago
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Github 915 about 1 month ago
Listen, Think, and Understand
Github 385 7 months ago
Demo 385 7 months ago
Github 4,094 3 months ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Github 174 8 months ago
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Github 9,926 about 1 month ago
VideoChat: Chat-Centric Video Understanding
Github 3,068 3 months ago
Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Github 1,477 over 1 year ago
Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Github 306 over 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Github 48 4 months ago
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Github 5,754 8 months ago
Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Github 2,321 about 1 month ago
Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Github 25,422 3 months ago
Visual Instruction Tuning
GitHub 20,232 3 months ago
Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Github 5,754 8 months ago
Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Github 133 over 1 year ago

Awesome Papers / Multimodal Hallucination

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Github 27 7 days ago
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Github 31 6 days ago
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Link
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Github 67 15 days ago
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Github 20 about 1 month ago
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Github 15 4 months ago
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Coming soon
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap
Coming soon
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
Github 15 about 2 months ago
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Debiasing Multimodal Large Language Models
Github 71 8 months ago
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Github 69 6 months ago
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Github 31 24 days ago
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
Github 16 5 months ago
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
Github 8 10 months ago
Unified Hallucination Detection for Multimodal Large Language Models
Github 48 7 months ago
A Survey on Hallucination in Large Vision-Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Github 79 10 months ago
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
Github 12 about 1 month ago
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
Github 7 10 months ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 233 2 months ago
Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Github 287 3 months ago
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Github 209 about 2 months ago
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Github 65 10 months ago
Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Github 41 4 months ago
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Github 93 10 months ago
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
Github 25 13 days ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 611 5 months ago
Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
Github 28 8 months ago
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Github 134 7 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 319 about 1 year ago
Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Github 17 about 1 year ago
VIGC: Visual Instruction Generation and Correction
Github 90 10 months ago
Demo
Detecting and Preventing Hallucinations in Large Vision Language Models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Github 255 8 months ago
Demo
Evaluating Object Hallucination in Large Vision-Language Models
Github 179 8 months ago

Awesome Papers / Multimodal In-Context Learning

Visual In-Context Learning for Large Vision-Language Models
Can MLLMs Perform Text-to-Image In-Context Learning?
Github 28 13 days ago
Generative Multimodal Models are In-Context Learners
Github 1,659 about 2 months ago
Demo
Hijacking Context in Large Multi-modal Models
Towards More Unified In-context Visual Understanding
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Github 334 11 months ago
Demo
Link-Context Learning for Multimodal LLMs
Github 89 6 months ago
Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Github 3,742 3 months ago
Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner
Github 384 about 1 year ago
Generative Pretraining in Multimodality
Github 1,659 about 2 months ago
Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Github 3,563 9 months ago
Demo
Exploring Diverse In-Context Configurations for Image Captioning
Github 27 5 months ago
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,087 11 months ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,712 about 2 months ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 933 10 months ago
Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Github 50 over 1 year ago
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
Github 267 over 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 693 3 months ago
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Github 84 over 2 years ago
Flamingo: a Visual Language Model for Few-Shot Learning
Github 3,742 3 months ago
Demo
Multimodal Few-Shot Learning with Frozen Language Models

Awesome Papers / Multimodal Chain-of-Thought

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Github 68 7 months ago
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
Github 134 about 1 month ago
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Github 33 8 months ago
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Github 744 5 months ago
Demo
Explainable Multimodal Emotion Reasoning
Github 119 7 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Github 340 7 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,682 about 1 year ago
Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Coming soon
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,087 11 months ago
Demo
Chain of Thought Prompt Tuning in Vision Language Models
Coming soon
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 933 10 months ago
Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,551 11 months ago
Demo
Multimodal Chain-of-Thought Reasoning in Language Models
Github 3,810 5 months ago
Visual Programming: Compositional visual reasoning without training
Github 693 3 months ago
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Github 606 2 months ago

Awesome Papers / LLM-Aided Visual Reasoning

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Github 14 about 1 month ago
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Github 527 11 months ago
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Github 351 4 months ago
Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Github 186 4 months ago
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Github 611 5 months ago
Demo
MindAgent: Emergent Gaming Interaction
Github 74 5 months ago
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Github 351 12 months ago
Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Github 65 over 1 year ago
AVIS: Autonomous Visual Information Seeking with Large Language Models
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Github 760 11 months ago
Demo
Mindstorms in Natural Language-Based Societies of Mind
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Github 300 8 months ago
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Github 32 about 1 year ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Github 7 over 1 year ago
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Github 1,682 about 1 year ago
Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Github 1,087 11 months ago
Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Github 23,712 about 2 months ago
Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Github 933 10 months ago
Demo
ViperGPT: Visual Inference via Python Execution for Reasoning
Github 1,660 10 months ago
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Github 452 over 1 year ago
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Github 34,551 11 months ago
Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Github 40 over 1 year ago
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Github 9,926 about 1 month ago
Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Github 94 about 1 year ago
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
Github 228 about 1 year ago
Visual Programming: Compositional visual reasoning without training
Github 693 3 months ago
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Github 34,295 6 days ago

Awesome Papers / Foundation Models

Emu3: Next-Token Prediction is All You Need
Github 1,820 28 days ago
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Demo
Pixtral-12B
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Github 9,926 about 1 month ago
The Llama 3 Herd of Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Hello GPT-4o
The Claude 3 Model Family: Opus, Sonnet, Haiku
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini: A Family of Highly Capable Multimodal Models
Fuyu-8B: A Multimodal Architecture for AI Agents
Huggingface
Demo
Unified Model for Image, Video, Audio and Language Tasks
Github 224 11 months ago
Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
GPT-4V(ision) System Card
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Github 528 about 2 months ago
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Github 24 12 months ago
Generative Pretraining in Multimodality
Github 1,659 about 2 months ago
Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World
Github 20,176 12 days ago
Demo
Transfer Visual Prompt Generator across LLMs
Github 269 about 1 year ago
Demo
GPT-4 Technical Report
PaLM-E: An Embodied Multimodal Language Model
Demo
Prismer: A Vision-Language Model with An Ensemble of Experts
Github 1,298 10 months ago
Demo
Language Is Not All You Need: Aligning Perception with Language Models
Github 20,176 12 days ago
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Github 9,926 about 1 month ago
Demo
VIMA: General Robot Manipulation with Multimodal Prompts
Github 774 7 months ago
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Github 1,816 8 months ago
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Github 43 over 1 year ago
Language Models are General-Purpose Interfaces
Github 20,176 12 days ago

Awesome Papers / Evaluation

OmniBench: Towards The Future of Universal Omni-Language Models
Github 14 16 days ago
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Github 78 7 days ago
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Github 2 3 months ago
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Github 22 about 2 months ago
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Github 62 29 days ago
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Github 75 about 1 month ago
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Github 94 4 months ago
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Github 406 5 months ago
Benchmarking Large Multimodal Models against Common Corruptions
Github 27 10 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Github 288 10 months ago
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Github 12,711 2 days ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Github 83 3 months ago
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Github 67 12 months ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Github 24 3 months ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Github 55 about 1 month ago
VLM-Eval: A General Evaluation on Video Large Language Models
Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Github 53 8 months ago
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Github 287 8 months ago
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
An Early Evaluation of GPT-4V(ision)
Github 11 about 1 year ago
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
Github 120 about 1 year ago
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Github 243 8 days ago
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Github 237 2 months ago
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
Github 14 about 1 year ago
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Github 20 9 months ago
Can We Edit Multimodal Large Language Models?
Github 1,931 6 days ago
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
Github 11 about 1 year ago
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)
TouchStone: Evaluating Vision-Language Models by Language Models
Github 78 10 months ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Github 41 5 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Github 37 27 days ago
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Github 467 7 months ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Github 267 17 days ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Github 315 4 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Github 163 3 months ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Github 12,711 2 days ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Github 467 7 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Github 301 7 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Github 92 over 1 year ago
On The Hidden Mystery of OCR in Large Multimodal Models
Github 471 about 1 month ago

Awesome Papers / Multimodal RLHF

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Silkie: Preference Distillation for Large Visual Language Models
Github 85 11 months ago
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Github 233 2 months ago
Demo
Aligning Large Multimodal Models with Factually Augmented RLHF
Github 319 about 1 year ago
Demo

Awesome Papers / Others

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Github 45 3 months ago
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Github 261 7 months ago
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Github 132 4 months ago
Planting a SEED of Vision in Large Language Model
Github 576 2 months ago
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
Github 1,202 16 days ago
Contextual Object Detection with Multimodal Large Language Models
Github 202 about 1 month ago
Demo
Generating Images with Multimodal Language Models
Github 430 10 months ago
On Evaluating Adversarial Robustness of Large Vision-Language Models
Github 161 about 1 year ago
Grounding Language Models to Images for Multimodal Inputs and Outputs
Github 478 about 1 year ago
Demo

Awesome Datasets / Datasets of Pre-Training for Alignment

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
COYO-700M: Image-Text Pair Dataset 1,163 almost 2 years ago
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Microsoft COCO: Common Objects in Context
Im2Text: Describing Images Using 1 Million Captioned Photographs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Kosmos-2: Grounding Multimodal Large Language Models to the World
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Awesome Datasets / Datasets of Multimodal Instruction Tuning

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Link 2 3 months ago
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Link 33 5 months ago
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 9 months ago
Visually Dehallucinative Instruction Generation
Link 5 8 months ago
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 57 about 2 months ago
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Link
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Link 18 about 1 year ago
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 41 5 months ago
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Link 91 11 months ago
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Link
SVIT: Scaling up Visual Instruction Tuning
Link
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Link 1,563 about 2 months ago
Visual Instruction Tuning with Polite Flamingo
Link
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
Link
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Link
MotionGPT: Human Motion as a Foreign Language
Link 1,505 8 months ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 255 8 months ago
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Link 1,550 5 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 301 7 months ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,213 3 months ago
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,563 9 months ago
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Link
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Coming soon 1,556 3 months ago
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Link 760 11 months ago
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Coming soon
DetGPT: Detect What You Need via Reasoning
Link 755 4 months ago
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Coming soon
VideoChat: Chat-Centric Video Understanding
Link 1,413 about 2 months ago
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Link 306 over 1 year ago
LMEye: An Interactive Perception Network for Large Language Models
Link
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Link
Visual Instruction Tuning
Link
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Link 133 over 1 year ago

Awesome Datasets / Datasets of In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Link
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Link 3,563 9 months ago

Awesome Datasets / Datasets of Multimodal Chain-of-Thought

Explainable Multimodal Emotion Reasoning
Coming soon 119 7 months ago
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Coming soon 340 7 months ago
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction
Coming soon
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Link 606 2 months ago

Awesome Datasets / Datasets of Multimodal RLHF

Silkie: Preference Distillation for Large Visual Language Models
Link

Awesome Datasets / Benchmarks for Evaluation

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Link
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Link
OmniBench: Towards The Future of Universal Omni-Language Models
Link
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Link
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Link
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Link 406 5 months ago
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Link 28 8 months ago
TempCompass: Do Video LLMs Really Understand Videos?
Link 84 7 days ago
Can MLLMs Perform Text-to-Image In-Context Learning?
Link
Visually Dehallucinative Instruction Generation: Know What You Don't Know
Link 6 9 months ago
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Link 69 about 1 month ago
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Link 46 3 months ago
Benchmarking Large Multimodal Models against Common Corruptions
Link 27 10 months ago
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Link 288 10 months ago
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Link
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Link
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Link 57 about 2 months ago
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Link 117 11 months ago
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Link 24 3 months ago
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Link 55 about 1 month ago
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Link
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Link 84 about 2 months ago
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Link 3,068 3 months ago
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Link 53 8 months ago
OtterHD: A High-Resolution Multi-modality Model
Link
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Link 243 8 days ago
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
Link 100 8 months ago
Aligning Large Multimodal Models with Factually Augmented RLHF
Link
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
Link
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Link 41 5 months ago
Link-Context Learning for Multimodal LLMs
Link
Detecting and Preventing Hallucinations in Large Vision Language Models
Coming soon
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Link 356 6 months ago
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
Link 37 27 days ago
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Link 267 17 days ago
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Link 315 4 months ago
MMBench: Is Your Multi-modal Model an All-around Player?
Link 163 3 months ago
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Link 229 over 1 year ago
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Link 255 8 months ago
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Link 12,711 2 days ago
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Link 467 7 months ago
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Link 301 7 months ago
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Link 92 over 1 year ago
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Link 2,321 about 1 month ago

Awesome Datasets / Others

IMAD: IMage-Augmented multi-modal Dialogue
Link 4 over 1 year ago
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Link 1,213 3 months ago
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Link
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Link
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Link

Backlinks from these awesome lists:

More related projects: