LLaVAR

Visual Instruction Tuning

An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets.

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

GitHub

259 stars

5 watching

12 forks

Language: Python

last commit: about 1 year ago

chatbotchatgptgpt-4instruction-tuningllavamultimodalocrvision-and-language

llavar.github.io/

Related projects:

Repository	Description	Stars
icoz69/stablellava	A tool for generating and evaluating multimodal Large Language Models with visual instruction tuning capabilities	93
git-cloner/llama2-lora-fine-tuning	Fine-tuning the LLaMA 2 chat model using DeepSpeed and Lora for improved performance on a large dataset.	171
baai-dcai/visual-instruction-tuning	A dataset and model designed to scale visual instruction tuning using language-only GPT-4 models.	164
unlp-workshop/unlp-2024-shared-task	A shared task for fine-tuning large language models to answer questions and generate responses in Ukrainian.	13
llava-vl/llava-interactive-demo	An all-in-one demo for interactive image processing and generation	353
gordonhu608/mqt-llava	A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.	101
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
circleradon/osprey	This project presents a new approach to fine-grained visual understanding using pixel-wise mask regions in language instructions	781
rucaibox/comvint	Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks	18
aidc-ai/parrot	A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages.	34
aidc-ai/ovis	An MLLM architecture designed to align visual and textual embeddings through structural alignment	575
wisconsinaivision/vip-llava	A system designed to enable large multimodal models to understand arbitrary visual prompts	302
haotian-liu/llava	A system that uses large language and vision models to generate and process visual instructions	20,683
ucsc-vlaa/sight-beyond-text	An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models	19
alibaba/conv-llava	This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance.	106