MGM

Vision-LM Framework

An open-source framework for training large language models with vision capabilities.

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

GitHub

3k stars
28 watching
278 forks
Language: Python
last commit: 7 months ago
generationlarge-language-modelsvision-language-model

Related projects:

Repository Description Stars
haotian-liu/llava A system that uses large language and vision models to generate and process visual instructions 20,232
alpha-vllm/llama2-accessory An open-source toolkit for pretraining and fine-tuning large language models 2,720
opengvlab/llama-adapter An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy 5,754
llava-vl/llava-next Develops large multimodal models for various computer vision tasks including image and video analysis 2,872
borisdayma/dalle-mini Generates images from text prompts using a variant of the DALL-E model 14,751
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,422
pku-yuangroup/video-llava This project enables large language models to perform visual reasoning capabilities on images and videos simultaneously by learning united visual representations before projection. 2,990
openbmb/minicpm-v A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs. 12,619
nvlabs/eagle Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions 539
eleutherai/lm-evaluation-harness Provides a unified framework to test generative language models on various evaluation tasks. 6,970
amazon-science/mm-cot An implementation of multimodal chain-of-thought reasoning in language models using a decoupled training framework for rationale generation and answer inference. 3,810
open-mmlab/mmcv Provides a foundational library for computer vision research and training deep learning models with high-quality implementation of common CPU and CUDA ops. 5,906
yfzhang114/llava-align Debiasing techniques to minimize hallucinations in large visual language models 71
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,045
luodian/otter A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. 3,563