Ovis

Multimodal alignment

An architecture designed to align visual and textual embeddings in multimodal learning

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

GitHub

536 stars
7 watching
32 forks
Language: Python
last commit: 20 days ago
chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model

Related projects:

Repository Description Stars
aidc-ai/parrot A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. 32
rlhf-v/rlhf-v Aligns large language models' behavior through fine-grained correctional human feedback to improve trustworthiness and accuracy. 233
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 582
ucsc-vlaa/sight-beyond-text This repository provides an official implementation of a research paper exploring the use of multi-modal training to enhance language models' truthfulness and ethics in various applications. 19
pku-alignment/align-anything Aligns large models with human values and intentions across various modalities. 244
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,298
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 723
salt-nlp/llavar An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. 258
deepseek-ai/deepseek-vl A multimodal AI model that enables real-world vision-language understanding applications 2,077
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 14
tanloong/interlaced.nvim Aligns bilingual parallel texts by repositioning lines. 6
lancopku/iais This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. 30
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 294
pku-yuangroup/moe-llava Develops a neural network architecture for multi-modal learning with large vision-language models 1,980
opengvlab/visionllm A large language model designed to process and generate visual information 915