Osprey

Visual guidance

This project presents a new approach to fine-grained visual understanding using pixel-wise mask regions in language instructions

[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

GitHub

770 stars
14 watching
43 forks
Language: Python
last commit: 4 months ago
mllmpixel-understandingsamvisual-instruction-tuning

Related projects:

Repository Description Stars
rucaibox/comvint Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks 18
salt-nlp/llavar An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. 258
roboflow/maestro A tool to streamline fine-tuning of multimodal models for vision-language tasks 1,386
ys-zong/vlguard Improves safety and helpfulness of large language models by fine-tuning them using safety-critical tasks 45
jshilong/gpt4roi Training and deploying large language models on computer vision tasks using region-of-interest inputs 506
aidc-ai/parrot A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. 30
aidc-ai/ovis An architecture designed to align visual and textual embeddings in multimodal learning 517
penghao-wu/vstar PyTorch implementation of guided visual search mechanism for multimodal LLMs 527
bigredt/vico Multi-sense word embeddings learned from visual cooccurrences 25
codeplant/simple-navigation A Ruby gem for creating hierarchical navigation structures in web applications 885
baai-dcai/visual-instruction-tuning A dataset and model designed to scale visual instruction tuning using language-only GPT-4 models. 163
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 311
kunpengli1994/vsrn An open-source PyTorch implementation of a visual semantic reasoning model for image-text matching 294
sy-xuan/pink This project enables multi-modal language models to understand and generate text about visual content using referential comprehension. 76
dannnylo/rtesseract A Ruby library providing an interface to the Tesseract OCR system. 828