MiniCPM-V

Multimodal LLM

A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs.

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

GitHub

13k stars
105 watching
889 forks
Language: Python
last commit: about 1 month ago
Linked from 2 awesome lists

minicpmminicpm-vmulti-modal

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
opengvlab/internvl A pioneering open-source alternative to commercial multimodal models with a family of large-scale language and vision models. 6,014
open-mmlab/mmcv Provides a foundational library for computer vision research and training deep learning models with high-quality implementation of common CPU and CUDA ops. 5,906
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,089
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,422
dvlab-research/mgm An open-source framework for training large language models with vision capabilities. 3,211
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,079
opengvlab/llama-adapter An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy 5,754
open-mmlab/mmaction2 A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. 4,296
openmv/openmv A platform for machine vision development with programmable cameras and extensive image processing capabilities 2,438
internlm/internlm-xcomposer A large vision language model that can understand and generate text from visual inputs, with capabilities for long-contextual input and output, high-resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. 2,521
openbmb/toolbench A platform for training, serving, and evaluating large language models to enable tool use capability 4,843
thudm/cogvlm Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. 6,080
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 72
luodian/otter A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. 3,563
cambrian-mllm/cambrian An open-source multimodal LLM project with a vision-centric design 1,759