Monkey
Conversational AI model kit
A toolkit for building conversational AI models that can process images and text inputs.
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
2k stars
22 watching
131 forks
Language: Python
last commit: 9 days ago Related projects:
Repository | Description | Stars |
---|---|---|
kohjingyu/fromage | A framework for grounding language models to images and handling multimodal inputs and outputs | 478 |
lyuchenyang/macaw-llm | A multi-modal language model that integrates image, video, audio, and text data to improve language understanding and generation | 1,550 |
yuliang-liu/multimodalocr | An evaluation benchmark for OCR capabilities in large multmodal models. | 471 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 72 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,089 |
yfzhang114/slime | Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. | 137 |
zhourax/vega | Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. | 33 |
bytedance/lynx-llm | A framework for training GPT4-style language models with multimodal inputs using large datasets and pre-trained models | 229 |
yuweihao/mm-vet | Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics | 267 |
runpeidong/dreamllm | A framework to build versatile Multimodal Large Language Models with synergistic comprehension and creation capabilities | 394 |
phellonchen/x-llm | A framework that enables large language models to process and understand multimodal inputs from various sources such as images and speech. | 306 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 576 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 14 |
mbzuai-oryx/groundinglmm | An end-to-end trained model capable of generating natural language responses integrated with object segmentation masks. | 781 |
yuxie11/r2d2 | A framework for large-scale cross-modal benchmarks and vision-language tasks in Chinese | 157 |