LLaVAR
Visual Instruction Tuning
An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets.
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
259 stars
5 watching
12 forks
Language: Python
last commit: 8 months ago chatbotchatgptgpt-4instruction-tuningllavamultimodalocrvision-and-language
Related projects:
Repository | Description | Stars |
---|---|---|
| A tool for generating and evaluating multimodal Large Language Models with visual instruction tuning capabilities | 93 |
| Fine-tuning the LLaMA 2 chat model using DeepSpeed and Lora for improved performance on a large dataset. | 171 |
| A dataset and model designed to scale visual instruction tuning using language-only GPT-4 models. | 164 |
| A shared task for fine-tuning large language models to answer questions and generate responses in Ukrainian. | 13 |
| An all-in-one demo for interactive image processing and generation | 353 |
| A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 101 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| This project presents a new approach to fine-grained visual understanding using pixel-wise mask regions in language instructions | 781 |
| Creating synthetic visual reasoning instructions to improve the performance of large language models on image-related tasks | 18 |
| A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. | 34 |
| An MLLM architecture designed to align visual and textual embeddings through structural alignment | 575 |
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| A system that uses large language and vision models to generate and process visual instructions | 20,683 |
| An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models | 19 |
| This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. | 106 |