Ovis
Multimodal alignment
An architecture designed to align visual and textual embeddings in multimodal learning
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
536 stars
7 watching
32 forks
Language: Python
last commit: 20 days ago chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
aidc-ai/parrot | A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. | 32 |
rlhf-v/rlhf-v | Aligns large language models' behavior through fine-grained correctional human feedback to improve trustworthiness and accuracy. | 233 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 582 |
ucsc-vlaa/sight-beyond-text | This repository provides an official implementation of a research paper exploring the use of multi-modal training to enhance language models' truthfulness and ethics in various applications. | 19 |
pku-alignment/align-anything | Aligns large models with human values and intentions across various modalities. | 244 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,298 |
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 723 |
salt-nlp/llavar | An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 258 |
deepseek-ai/deepseek-vl | A multimodal AI model that enables real-world vision-language understanding applications | 2,077 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 14 |
tanloong/interlaced.nvim | Aligns bilingual parallel texts by repositioning lines. | 6 |
lancopku/iais | This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. | 30 |
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 294 |
pku-yuangroup/moe-llava | Develops a neural network architecture for multi-modal learning with large vision-language models | 1,980 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 915 |