Qwen2-VL

Multimodal LM

A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text.

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

GitHub

3k stars
28 watching
187 forks
Language: Python
last commit: about 2 months ago

Related projects:

Repository Description Stars
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,045
qwenlm/qwen2.5 A large language model series with various sizes and variants for text generation and understanding. 9,710
qwenlm/qwen This repository provides large language models and chat capabilities based on pre-trained Chinese models. 14,164
qwenlm/qwen-audio A multimodal audio language model developed by Alibaba Cloud that supports various tasks and languages 1,486
internlm/internlm-xcomposer A large vision language model that can understand and generate text from visual inputs, with capabilities for long-contextual input and output, high-resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. 2,521
sgl-project/sglang A framework for serving large language models and vision models with efficient runtime and flexible interface. 6,082
haotian-liu/llava A system that uses large language and vision models to generate and process visual instructions 20,232
alpha-vllm/llama2-accessory An open-source toolkit for pretraining and fine-tuning large language models 2,720
qwenlm/qwen2-audio An audio-language model that can analyze or respond to speech instructions based on audio input 1,229
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,422
opengvlab/llama-adapter An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy 5,754
eleutherai/lm-evaluation-harness Provides a unified framework to test generative language models on various evaluation tasks. 6,970
llava-vl/llava-next Develops large multimodal models for various computer vision tasks including image and video analysis 2,872
wang-bin/qtav A multimedia framework that provides an easy-to-use API for building video players across multiple platforms. 3,985
pku-yuangroup/video-llava This project enables large language models to perform visual reasoning capabilities on images and videos simultaneously by learning united visual representations before projection. 2,990