Qwen-VL
Large vision language model
A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
5k stars
49 watching
392 forks
Language: Python
last commit: about 1 year ago large-language-modelsvision-language-model
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text. | 3,613 |
| | This repository provides large language models and chat capabilities based on pre-trained Chinese models. | 14,797 |
| | A large language model series with various sizes and variants for text generation and understanding. | 10,959 |
| | A system that uses large language and vision models to generate and process visual instructions | 20,683 |
| | A comprehensive multimodal system for long-term streaming video and audio interactions with capabilities including text-image comprehension and composition | 2,616 |
| | A multimodal audio language model developed by Alibaba Cloud that supports various tasks and languages | 1,515 |
| | A fast serving framework for large language models and vision language models. | 6,551 |
| | An open-source toolkit for pretraining and fine-tuning large language models | 2,732 |
| | An open-source framework for training large language models with vision capabilities. | 3,229 |
| | Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,490 |
| | A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs. | 12,870 |
| | An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,775 |
| | Supports large-scale vision model training on GPU machines or Google Cloud TPUs using scalable input pipelines. | 2,439 |
| | Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. | 6,182 |
| | Develops large multimodal models for various computer vision tasks including image and video analysis | 3,099 |