 InternLM-XComposer
 InternLM-XComposer 
 Multimodal model
 A comprehensive multimodal system for long-term streaming video and audio interactions with capabilities including text-image comprehension and composition
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
3k stars
 44 watching
 158 forks
 
Language: Python 
last commit: 11 months ago   chatgptfoundationgptgpt-4instruction-tuninglanguage-modellarge-language-modellarge-vision-language-modelllmmllmmulti-modalitymultimodalsupervised-finetuningvision-language-modelvision-transformervisual-language-learning 
 Related projects:
| Repository | Description | Stars | 
|---|---|---|
|  | A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,179 | 
|  | A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text. | 3,613 | 
|  | A collection of large language models designed to improve reasoning and tool use capabilities in chatbots. | 6,572 | 
|  | Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,490 | 
|  | A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs. | 12,870 | 
|  | Develops large language models that can understand and generate human-like visual and video content | 2,365 | 
|  | A deep learning framework for generating videos from text inputs and visual features. | 3,071 | 
|  | A fast serving framework for large language models and vision language models. | 6,551 | 
|  | An open-source toolkit for pretraining and fine-tuning large language models | 2,732 | 
|  | A system that uses large language and vision models to generate and process visual instructions | 20,683 | 
|  | Develops large language models capable of processing multiple data types and modalities | 6,394 | 
|  | A toolkit for optimizing and serving large language models | 4,854 | 
|  | Supports large-scale vision model training on GPU machines or Google Cloud TPUs using scalable input pipelines. | 2,439 | 
|  | Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. | 6,182 | 
|  | A framework for training and serving large language models using JAX/Flax | 2,428 |