VisionLLM
Visual decoder
A large language model designed to process and generate visual information
VisionLLM Series
956 stars
45 watching
29 forks
Language: Python
last commit: 4 months ago generalist-modellarge-language-modelsobject-detection
Related projects:
Repository | Description | Stars |
---|---|---|
| Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
| An interactive tool that connects multiple visual models and an LLM to facilitate text-based conversations. | 1,213 |
| A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. | 1,923 |
| A research project that develops tools and models for understanding visual data in the open world, enabling applications such as image-text retrieval and relation comprehension. | 466 |
| Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| An open-source implementation of a vision-language instructed large language model | 513 |
| An open-source project that enables the transfer of language understanding to vision capabilities through long context processing. | 347 |
| An open-source framework for augmenting large language models with tools by searching on graphs to solve complex real-world tasks. | 187 |
| A guide to using pre-trained large language models in source code analysis and generation | 1,789 |
| A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
| A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 101 |
| A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,098 |
| An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |