LLaMA-VID

Video image processor

An image-based language model that uses large language models to generate visual and text features from videos

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)

GitHub

733 stars
14 watching
44 forks
Language: Python
last commit: 4 months ago

Related projects:

Repository Description Stars
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 351
dvlab-research/lisa A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. 1,861
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 294
dvlab-research/llmga An implementation of a multimodal generation assistant using large language models and various image editing techniques. 461
opengvlab/visionllm A large language model designed to process and generate visual information 915
freedomintelligence/longllava A system for scaling large language models to process and understand visual information from multiple images efficiently. 179
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 30
damo-nlp-sg/videollama2 An audio-visual language model designed to understand and generate video content 871
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 576
mlpc-ucsd/bliva A multimodal LLM designed to handle text-rich visual questions 269
libav/libav A collection of libraries and tools for processing multimedia content 1,082
liuzhao1225/youdub-webui A web-based video processing tool that uses AI to facilitate cultural and linguistic tasks such as transcription, translation, and audio synthesis. 1,940
libvips/lua-vips A Lua binding for a fast image processing library with low memory needs. 127
luispedro/mahotas A library of fast computer vision algorithms implemented in C++ for speed, operating over numpy arrays. 844
dvlab-research/prompt-highlighter An interactive control system for text generation in multi-modal language models 132