Osprey

Visual guidance

This project presents a new approach to fine-grained visual understanding using pixel-wise mask regions in language instructions

[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

GitHub

781 stars

14 watching

42 forks

Language: Python

last commit: 11 months ago

mllmpixel-understandingsamvisual-instruction-tuning

Related projects:

Repository	Description	Stars
rucaibox/comvint	Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks	18
salt-nlp/llavar	An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets.	259
roboflow/maestro	A tool to streamline fine-tuning of multimodal models for vision-language tasks	1,415
ys-zong/vlguard	Improves safety and helpfulness of large language models by fine-tuning them using safety-critical tasks	47
jshilong/gpt4roi	Training and deploying large language models on computer vision tasks using region-of-interest inputs	517
aidc-ai/parrot	A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages.	34
aidc-ai/ovis	An MLLM architecture designed to align visual and textual embeddings through structural alignment	575
penghao-wu/vstar	PyTorch implementation of guided visual search mechanism for multimodal LLMs	541
bigredt/vico	Multi-sense word embeddings learned from visual cooccurrences	25
codeplant/simple-navigation	A Ruby gem for creating hierarchical navigation structures in web applications	886
baai-dcai/visual-instruction-tuning	A dataset and model designed to scale visual instruction tuning using language-only GPT-4 models.	164
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314
kunpengli1994/vsrn	An open-source PyTorch implementation of a visual semantic reasoning model for image-text matching	294
sy-xuan/pink	This project enables multi-modal language models to understand and generate text about visual content using referential comprehension.	79
dannnylo/rtesseract	A Ruby library providing an interface to the Tesseract OCR system.	838