Ovis
Multimodal aligner
An MLLM architecture designed to align visual and textual embeddings through structural alignment
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
575 stars
7 watching
33 forks
Language: Python
last commit: 2 months ago chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
aidc-ai/parrot | A method and toolkit for fine-tuning large language models to perform visual instruction tasks in multiple languages. | 34 |
rlhf-v/rlhf-v | Aligns large language models' behavior through fine-grained correctional human feedback to improve trustworthiness and accuracy. | 245 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
ucsc-vlaa/sight-beyond-text | An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models | 19 |
pku-alignment/align-anything | Aligns large multimodal models with human intentions and values using various algorithms and fine-tuning methods. | 270 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
salt-nlp/llavar | An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 259 |
deepseek-ai/deepseek-vl | A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 15 |
tanloong/interlaced.nvim | A plugin for aligning bilingual parallel texts by re-positioning text and applying highlighting. | 7 |
lancopku/iais | This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. | 30 |
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
pku-yuangroup/moe-llava | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 956 |