Visual-CoT

Visual reasoning engine

A framework for training multi-modal language models with a focus on visual inputs and providing interpretable thoughts.

[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

GitHub

162 stars
1 watching
7 forks
Language: Python
last commit: about 2 months ago

Related projects:

Repository Description Stars
rowanz/r2c An open-source project providing PyTorch code and data for a deep learning model that enables visual commonsense reasoning. 466
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,336
zhegan27/semantic_compositional_nets A deep learning framework providing a model architecture and training code for image captioning using semantic compositional networks 70
davidmascharka/tbd-nets An open-source implementation of a deep learning model designed to improve the balance between performance and interpretability in visual reasoning tasks. 348
cod3licious/conec A library for training and evaluating a type of word embedding model that extends the original Word2Vec algorithm 20
rucaibox/comvint Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks 18
cadene/vqa.pytorch A PyTorch implementation of visual question answering with multimodal representation learning 718
bigredt/vico Multi-sense word embeddings learned from visual cooccurrences 25
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 32
deepseek-ai/deepseek-vl A multimodal AI model that enables real-world vision-language understanding applications 2,145
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 259
kazuto1011/deeplab-pytorch PyTorch implementation of DeepLab v2 for semantic segmentation on COCO-Stuff and PASCAL VOC datasets 1,098
jayleicn/clipbert An efficient framework for end-to-end learning on image-text and video-text tasks 709
satwikkottur/visualword2vec Learning word embeddings from abstract images to improve language understanding 19
kdexd/virtex A pretraining approach that uses semantically dense captions to learn visual representations and improve image understanding tasks. 556