MiniCPM-V

Multimodal LLM

A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs.

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

GitHub

13k stars
108 watching
902 forks
Language: Python
last commit: 4 months ago
Linked from 2 awesome lists

minicpmminicpm-vmulti-modal

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
opengvlab/internvl Develops large language models capable of processing multiple data types and modalities 6,394
open-mmlab/mmcv Provides a foundational library for computer vision research and training deep learning models with high-quality implementation of common CPU and CUDA ops. 5,948
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,098
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,490
dvlab-research/mgm An open-source framework for training large language models with vision capabilities. 3,229
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,179
opengvlab/llama-adapter An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy 5,775
open-mmlab/mmaction2 A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. 4,360
openmv/openmv A platform for machine vision development with programmable cameras and extensive image processing capabilities 2,446
internlm/internlm-xcomposer A comprehensive multimodal system for long-term streaming video and audio interactions with capabilities including text-image comprehension and composition 2,616
openbmb/toolbench A platform for training, serving, and evaluating large language models to enable tool use capability 4,888
thudm/cogvlm Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. 6,182
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 73
luodian/otter A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. 3,570
cambrian-mllm/cambrian An open-source multimodal LLM project with a vision-centric design 1,799