Ovis
Multimodal aligner
An MLLM architecture designed to align visual and textual embeddings through structural alignment
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
575 stars
7 watching
33 forks
Language: Python
last commit: 3 months ago chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
| A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. | 34 |
| Aligns large language models' behavior through fine-grained correctional human feedback to improve trustworthiness and accuracy. | 245 |
| An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
| An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models | 19 |
| Aligns large multimodal models with human intentions and values using various algorithms and fine-tuning methods. | 270 |
| A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
| An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 259 |
| A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
| Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 15 |
| A plugin for aligning bilingual parallel texts by re-positioning text and applying highlighting. | 7 |
| This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. | 30 |
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
| A large language model designed to process and generate visual information | 956 |