MiniCPM-V
Multimodal LLM
A multimodal language model designed to understand images, videos, and text inputs and generate high-quality text outputs.
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
13k stars
105 watching
889 forks
Language: Python
last commit: about 1 month ago
Linked from 2 awesome lists
minicpmminicpm-vmulti-modal
Related projects:
Repository | Description | Stars |
---|---|---|
opengvlab/internvl | A pioneering open-source alternative to commercial multimodal models with a family of large-scale language and vision models. | 6,014 |
open-mmlab/mmcv | Provides a foundational library for computer vision research and training deep learning models with high-quality implementation of common CPU and CUDA ops. | 5,906 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,089 |
vision-cair/minigpt-4 | Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,422 |
dvlab-research/mgm | An open-source framework for training large language models with vision capabilities. | 3,211 |
qwenlm/qwen-vl | A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,079 |
opengvlab/llama-adapter | An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,754 |
open-mmlab/mmaction2 | A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. | 4,296 |
openmv/openmv | A platform for machine vision development with programmable cameras and extensive image processing capabilities | 2,438 |
internlm/internlm-xcomposer | A large vision language model that can understand and generate text from visual inputs, with capabilities for long-contextual input and output, high-resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. | 2,521 |
openbmb/toolbench | A platform for training, serving, and evaluating large language models to enable tool use capability | 4,843 |
thudm/cogvlm | Develops a state-of-the-art visual language model with applications in image understanding and dialogue systems. | 6,080 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 72 |
luodian/otter | A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. | 3,563 |
cambrian-mllm/cambrian | An open-source multimodal LLM project with a vision-centric design | 1,759 |