OMG-Seg

Visual Model

Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.

OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]

GitHub

1k stars
23 watching
49 forks
Language: Python
last commit: about 2 months ago

Related projects:

Repository Description Stars
opengvlab/visionllm A large language model designed to process and generate visual information 915
vhellendoorn/code-lms A guide to using pre-trained large language models in source code analysis and generation 1,782
vpgtrans/vpgtrans Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs 269
l0sg/relational-rnn-pytorch An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch for word language modeling 244
deepcs233/visual-cot Develops a multi-modal language model with a comprehensive dataset and benchmark for chain-of-thought reasoning 134
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 508
opennlg/openba A pre-trained language model designed for various NLP tasks, including dialogue generation, code completion, and retrieval. 94
gt-vision-lab/vqa_lstm_cnn A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. 376
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 30
gordonhu608/mqt-llava A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. 97
openseg-group/openseg.pytorch Provides a PyTorch implementation of several computer vision tasks including object detection, segmentation and parsing. 1,190
airaria/visual-chinese-llama-alpaca Develops a multimodal Chinese language model with visual capabilities 424
yfzhang114/slime Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. 137
llava-vl/llava-plus-codebase A platform for training and deploying large language and vision models that can use tools to perform tasks 704
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243