Qwen2-VL
Multimodal LM
A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text.
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
4k stars
30 watching
224 forks
Language: Python
last commit: 11 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,179 |
| A large language model series with various sizes and variants for text generation and understanding. | 10,959 |
| This repository provides large language models and chat capabilities based on pre-trained Chinese models. | 14,797 |
| A multimodal audio language model developed by Alibaba Cloud that supports various tasks and languages | 1,515 |
| A comprehensive multimodal system for long-term streaming video and audio interactions with capabilities including text-image comprehension and composition | 2,616 |
| A fast serving framework for large language models and vision language models. | 6,551 |
| A system that uses large language and vision models to generate and process visual instructions | 20,683 |
| An open-source toolkit for pretraining and fine-tuning large language models | 2,732 |
| An audio-language model that can analyze or respond to speech instructions based on audio input | 1,306 |
| Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,490 |
| An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,775 |
| Provides a unified framework to test generative language models on various evaluation tasks. | 7,200 |
| Develops large multimodal models for various computer vision tasks including image and video analysis | 3,099 |
| A multimedia framework that provides an easy-to-use API for building video players across multiple platforms. | 4,001 |
| A deep learning framework for generating videos from text inputs and visual features. | 3,071 |