Visual-CoT
Visual reasoning engine
A framework for training multi-modal language models with a focus on visual inputs and providing interpretable thoughts.
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
162 stars
1 watching
7 forks
Language: Python
last commit: 3 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| An open-source project providing PyTorch code and data for a deep learning model that enables visual commonsense reasoning. | 466 |
| Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
| A deep learning framework providing a model architecture and training code for image captioning using semantic compositional networks | 70 |
| An open-source implementation of a deep learning model designed to improve the balance between performance and interpretability in visual reasoning tasks. | 348 |
| A library for training and evaluating a type of word embedding model that extends the original Word2Vec algorithm | 20 |
| Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks | 18 |
| A PyTorch implementation of visual question answering with multimodal representation learning | 718 |
| Multi-sense word embeddings learned from visual cooccurrences | 25 |
| A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
| A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
| An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |
| PyTorch implementation of DeepLab v2 for semantic segmentation on COCO-Stuff and PASCAL VOC datasets | 1,098 |
| An efficient framework for end-to-end learning on image-text and video-text tasks | 709 |
| Learning word embeddings from abstract images to improve language understanding | 19 |
| A pretraining approach that uses semantically dense captions to learn visual representations and improve image understanding tasks. | 556 |