OMG-Seg
Visual Model
Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
1k stars
22 watching
50 forks
Language: Python
last commit: 11 months ago Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A large language model designed to process and generate visual information | 956 |
| | A guide to using pre-trained large language models in source code analysis and generation | 1,789 |
| | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
| | An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch for word language modeling | 245 |
| | A framework for training multi-modal language models with a focus on visual inputs and providing interpretable thoughts. | 162 |
| | An open-source implementation of a vision-language instructed large language model | 513 |
| | A pre-trained language model designed for various NLP tasks, including dialogue generation, code completion, and retrieval. | 94 |
| | A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. | 377 |
| | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
| | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 101 |
| | Provides a PyTorch implementation of several computer vision tasks including object detection, segmentation and parsing. | 1,191 |
| | Develops a multimodal Chinese language model with visual capabilities | 429 |
| | Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. | 143 |
| | A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
| | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |