EVE

Vision-Language Model

A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models

GitHub

230 stars
8 watching
3 forks
Language: Python
last commit: about 2 months ago
clipencoder-free-vlminstruction-followinglarge-language-modelsllmmllmmultimodal-large-language-modelsvision-language-modelsvlm

Related projects:

Repository Description Stars
baai-wudao/brivl Pre-trains a multilingual model to bridge vision and language modalities for various downstream applications 279
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,298
deepseek-ai/deepseek-vl A multimodal AI model that enables real-world vision-language understanding applications 2,077
nvlabs/eagle Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions 539
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 311
jayleicn/clipbert An efficient framework for end-to-end learning on image-text and video-text tasks 704
byungkwanlee/collavo Develops a PyTorch implementation of an enhanced vision language model 93
yiren-jian/blitext Develops and trains models for vision-language learning with decoupled language pre-training 24
baaivision/emu A multimodal generative model framework 1,659
freedomintelligence/allava A collection of datasets and models designed to support the training of lite vision-language models. 246
shizhediao/davinci An implementation of vision-language models for multimodal learning tasks, enabling generative vision-language models to be fine-tuned for various applications. 43
paganpasta/eqxvision A package of pre-trained computer vision models for image classification and segmentation. 102
awni/speech A PyTorch implementation of end-to-end speech recognition models. 754
vishaal27/sus-x This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required. 94
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 294