Visual-CoT
Visual reasoning engine
A framework for training multi-modal language models with a focus on visual inputs and providing interpretable thoughts.
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
162 stars
1 watching
7 forks
Language: Python
last commit: about 2 months ago Related projects:
Repository | Description | Stars |
---|---|---|
rowanz/r2c | An open-source project providing PyTorch code and data for a deep learning model that enables visual commonsense reasoning. | 466 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
zhegan27/semantic_compositional_nets | A deep learning framework providing a model architecture and training code for image captioning using semantic compositional networks | 70 |
davidmascharka/tbd-nets | An open-source implementation of a deep learning model designed to improve the balance between performance and interpretability in visual reasoning tasks. | 348 |
cod3licious/conec | A library for training and evaluating a type of word embedding model that extends the original Word2Vec algorithm | 20 |
rucaibox/comvint | Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks | 18 |
cadene/vqa.pytorch | A PyTorch implementation of visual question answering with multimodal representation learning | 718 |
bigredt/vico | Multi-sense word embeddings learned from visual cooccurrences | 25 |
360cvgroup/360vl | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
deepseek-ai/deepseek-vl | A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |
kazuto1011/deeplab-pytorch | PyTorch implementation of DeepLab v2 for semantic segmentation on COCO-Stuff and PASCAL VOC datasets | 1,098 |
jayleicn/clipbert | An efficient framework for end-to-end learning on image-text and video-text tasks | 709 |
satwikkottur/visualword2vec | Learning word embeddings from abstract images to improve language understanding | 19 |
kdexd/virtex | A pretraining approach that uses semantically dense captions to learn visual representations and improve image understanding tasks. | 556 |