Ovis

Multimodal aligner

An MLLM architecture designed to align visual and textual embeddings through structural alignment

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

GitHub

575 stars
7 watching
33 forks
Language: Python
last commit: 2 months ago
chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model

Related projects:

Repository Description Stars
aidc-ai/parrot A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. 34
rlhf-v/rlhf-v Aligns large language models' behavior through fine-grained correctional human feedback to improve trustworthiness and accuracy. 245
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 585
ucsc-vlaa/sight-beyond-text An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models 19
pku-alignment/align-anything Aligns large multimodal models with human intentions and values using various algorithms and fine-tuning methods. 270
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 751
salt-nlp/llavar An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. 259
deepseek-ai/deepseek-vl A multimodal AI model that enables real-world vision-language understanding applications 2,145
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 15
tanloong/interlaced.nvim A plugin for aligning bilingual parallel texts by re-positioning text and applying highlighting. 7
lancopku/iais This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. 30
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
pku-yuangroup/moe-llava A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks 2,023
opengvlab/visionllm A large language model designed to process and generate visual information 956