MiniCPM-V
Multimodal LLM
A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs.
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
13k stars
108 watching
902 forks
Language: Python
last commit: 4 months ago
Linked from 2 awesome lists
minicpmminicpm-vmulti-modal
Related projects:
Repository | Description | Stars |
---|---|---|
| Develops large language models capable of processing multiple data types and modalities | 6,394 |
| Provides a foundational library for computer vision research and training deep learning models with high-quality implementation of common CPU and CUDA ops. | 5,948 |
| A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,098 |
| Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,490 |
| An open-source framework for training large language models with vision capabilities. | 3,229 |
| A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,179 |
| An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,775 |
| A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. | 4,360 |
| A platform for machine vision development with programmable cameras and extensive image processing capabilities | 2,446 |
| A comprehensive multimodal system for long-term streaming video and audio interactions with capabilities including text-image comprehension and composition | 2,616 |
| A platform for training, serving, and evaluating large language models to enable tool use capability | 4,888 |
| Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. | 6,182 |
| A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 73 |
| A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. | 3,570 |
| An open-source multimodal LLM project with a vision-centric design | 1,799 |