VisionLLM

Visual decoder

A large language model designed to process and generate visual information

VisionLLM Series

GitHub

956 stars
45 watching
29 forks
Language: Python
last commit: 3 months ago
generalist-modellarge-language-modelsobject-detection

Related projects:

Repository Description Stars
vpgtrans/vpgtrans Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs 270
visual-openllm/visual-openllm An interactive tool that connects multiple visual models and an LLM to facilitate text-based conversations. 1,213
dvlab-research/lisa A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. 1,923
opengvlab/all-seeing A research project that develops tools and models for understanding visual data in the open world, enabling applications such as image-text retrieval and relation comprehension. 466
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,336
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 748
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 513
evolvinglmms-lab/longva An open-source project that enables the transfer of language understanding to vision capabilities through long context processing. 347
opengvlab/controlllm An open-source framework for augmenting large language models with tools by searching on graphs to solve complex real-world tasks. 187
vhellendoorn/code-lms A guide to using pre-trained large language models in source code analysis and generation 1,789
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 32
gordonhu608/mqt-llava A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. 101
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,098
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 585