bubogpt

Visual grounding framework

An open-source framework enabling multi-modal LLMs to jointly understand text, vision, and audio and ground knowledge into visual objects.

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

GitHub

502 stars
10 watching
35 forks
Language: Python
last commit: over 1 year ago

Related projects:

Repository Description Stars
theshadow29/zsgnet-pytorch An implementation of a computer vision model that grounds objects in images using natural language queries. 69
google-research/visu3d An abstraction layer between various deep learning frameworks and your program. 147
hxyou/idealgpt A deep learning framework for iteratively decomposing vision and language reasoning via large language models. 32
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,300
mbzuai-oryx/groundinglmm An end-to-end trained model capable of generating natural language responses integrated with object segmentation masks. 781
jy0205/lavit A unified framework for training large language models to understand and generate visual content 528
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243
jhcho99/coformer An implementation of a deep learning model for grounding situation recognition in images 43
penghao-wu/vstar PyTorch implementation of guided visual search mechanism for multimodal LLMs 527
vpgtrans/vpgtrans Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs 269
davidmascharka/tbd-nets An open-source implementation of a deep learning model designed to improve the balance between performance and interpretability in visual reasoning tasks. 348
asappresearch/flambe An ML framework for accelerating research and its integration into production workflows 262
nvlabs/bongard-hoi A benchmarking tool and software framework for evaluating few-shot visual reasoning capabilities in computer vision models. 64
aifeg/benchlmm An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models 83
s-gupta/visual-concepts This codebase provides a framework for detecting visual concepts in images by leveraging image captions and pre-trained models. 151