Qwen2-VL
Multimodal LM
A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text.
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
3k stars
28 watching
187 forks
Language: Python
last commit: about 2 months ago Related projects:
Repository | Description | Stars |
---|---|---|
qwenlm/qwen-vl | A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,045 |
qwenlm/qwen2.5 | A large language model series with various sizes and variants for text generation and understanding. | 9,710 |
qwenlm/qwen | This repository provides large language models and chat capabilities based on pre-trained Chinese models. | 14,164 |
qwenlm/qwen-audio | A multimodal audio language model developed by Alibaba Cloud that supports various tasks and languages | 1,486 |
internlm/internlm-xcomposer | A large vision language model that can understand and generate text from visual inputs, with capabilities for long-contextual input and output, high-resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. | 2,521 |
sgl-project/sglang | A framework for serving large language models and vision models with efficient runtime and flexible interface. | 6,082 |
haotian-liu/llava | A system that uses large language and vision models to generate and process visual instructions | 20,232 |
alpha-vllm/llama2-accessory | An open-source toolkit for pretraining and fine-tuning large language models | 2,720 |
qwenlm/qwen2-audio | An audio-language model that can analyze or respond to speech instructions based on audio input | 1,229 |
vision-cair/minigpt-4 | Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,422 |
opengvlab/llama-adapter | An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,754 |
eleutherai/lm-evaluation-harness | Provides a unified framework to test generative language models on various evaluation tasks. | 6,970 |
llava-vl/llava-next | Develops large multimodal models for various computer vision tasks including image and video analysis | 2,872 |
wang-bin/qtav | A multimedia framework that provides an easy-to-use API for building video players across multiple platforms. | 3,985 |
pku-yuangroup/video-llava | This project enables large language models to perform visual reasoning capabilities on images and videos simultaneously by learning united visual representations before projection. | 2,990 |