OMG-Seg
Visual Model
Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
1k stars
23 watching
49 forks
Language: Python
last commit: about 2 months ago Related projects:
Repository | Description | Stars |
---|---|---|
opengvlab/visionllm | A large language model designed to process and generate visual information | 915 |
vhellendoorn/code-lms | A guide to using pre-trained large language models in source code analysis and generation | 1,782 |
vpgtrans/vpgtrans | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 269 |
l0sg/relational-rnn-pytorch | An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch for word language modeling | 244 |
deepcs233/visual-cot | Develops a multi-modal language model with a comprehensive dataset and benchmark for chain-of-thought reasoning | 134 |
luogen1996/lavin | An open-source implementation of a vision-language instructed large language model | 508 |
opennlg/openba | A pre-trained language model designed for various NLP tasks, including dialogue generation, code completion, and retrieval. | 94 |
gt-vision-lab/vqa_lstm_cnn | A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. | 376 |
360cvgroup/360vl | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 30 |
gordonhu608/mqt-llava | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 97 |
openseg-group/openseg.pytorch | Provides a PyTorch implementation of several computer vision tasks including object detection, segmentation and parsing. | 1,190 |
airaria/visual-chinese-llama-alpaca | Develops a multimodal Chinese language model with visual capabilities | 424 |
yfzhang114/slime | Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. | 137 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 704 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 243 |