LLaMA-VID

Video image processor

An image-based language model that uses large language models to generate visual and text features from videos

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)

GitHub

748 stars
14 watching
45 forks
Language: Python
last commit: 6 months ago

Related projects:

Repository Description Stars
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 353
dvlab-research/lisa A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. 1,923
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
dvlab-research/llmga An implementation of a multimodal generation assistant using large language models and various image editing techniques. 463
opengvlab/visionllm A large language model designed to process and generate visual information 956
freedomintelligence/longllava A system for scaling large language models to process and understand visual information from multiple images efficiently. 183
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 32
damo-nlp-sg/videollama2 An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing. 957
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 585
mlpc-ucsd/bliva A multimodal LLM designed to handle text-rich visual questions 270
libav/libav A collection of libraries and tools for processing multimedia content 1,086
liuzhao1225/youdub-webui A web-based video processing tool that uses AI to facilitate cultural and linguistic tasks such as transcription, translation, and audio synthesis. 1,980
libvips/lua-vips A Lua binding for a fast image processing library with low memory needs. 129
luispedro/mahotas A library of fast computer vision algorithms implemented in C++ for speed, operating over numpy arrays. 855
dvlab-research/prompt-highlighter An interactive control system for text generation in multi-modal language models 135