LLaMA-VID
Video image processor
An image-based language model that uses large language models to generate visual and text features from videos
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
748 stars
14 watching
45 forks
Language: Python
last commit: 6 months ago Related projects:
Repository | Description | Stars |
---|---|---|
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 353 |
dvlab-research/lisa | A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. | 1,923 |
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
dvlab-research/llmga | An implementation of a multimodal generation assistant using large language models and various image editing techniques. | 463 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 956 |
freedomintelligence/longllava | A system for scaling large language models to process and understand visual information from multiple images efficiently. | 183 |
360cvgroup/360vl | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
damo-nlp-sg/videollama2 | An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing. | 957 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
mlpc-ucsd/bliva | A multimodal LLM designed to handle text-rich visual questions | 270 |
libav/libav | A collection of libraries and tools for processing multimedia content | 1,086 |
liuzhao1225/youdub-webui | A web-based video processing tool that uses AI to facilitate cultural and linguistic tasks such as transcription, translation, and audio synthesis. | 1,980 |
libvips/lua-vips | A Lua binding for a fast image processing library with low memory needs. | 129 |
luispedro/mahotas | A library of fast computer vision algorithms implemented in C++ for speed, operating over numpy arrays. | 855 |
dvlab-research/prompt-highlighter | An interactive control system for text generation in multi-modal language models | 135 |