bubogpt
Visual grounding framework
An open-source framework enabling multi-modal LLMs to jointly understand text, vision, and audio and ground knowledge into visual objects.
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
505 stars
10 watching
35 forks
Language: Python
last commit: over 1 year ago Related projects:
Repository | Description | Stars |
---|---|---|
theshadow29/zsgnet-pytorch | An implementation of a computer vision model that grounds objects in images using natural language queries. | 69 |
google-research/visu3d | An abstraction layer between various deep learning frameworks and your program. | 149 |
hxyou/idealgpt | A deep learning framework for iteratively decomposing vision and language reasoning via large language models. | 32 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
mbzuai-oryx/groundinglmm | An end-to-end trained model capable of generating natural language responses integrated with object segmentation masks for interactive visual conversations | 797 |
jy0205/lavit | A unified framework for training large language models to understand and generate visual content | 544 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |
jhcho99/coformer | An implementation of a deep learning model for grounding situation recognition in images | 45 |
penghao-wu/vstar | PyTorch implementation of guided visual search mechanism for multimodal LLMs | 541 |
vpgtrans/vpgtrans | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
davidmascharka/tbd-nets | An open-source implementation of a deep learning model designed to improve the balance between performance and interpretability in visual reasoning tasks. | 348 |
asappresearch/flambe | An ML framework for accelerating research and its integration into production workflows | 264 |
nvlabs/bongard-hoi | A benchmarking tool and software framework for evaluating few-shot visual reasoning capabilities in computer vision models. | 64 |
aifeg/benchlmm | An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 84 |
s-gupta/visual-concepts | This codebase provides a framework for detecting visual concepts in images by leveraging image captions and pre-trained models. | 150 |