EVE

Vision-Language Model

A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models

GitHub

246 stars

8 watching

4 forks

Language: Python

last commit: almost 2 years ago

clipencoder-free-vlminstruction-followinglarge-language-modelsllmmllmmultimodal-large-language-modelsvision-language-modelsvlm

Related projects:

Repository	Description	Stars
baai-wudao/brivl	Pre-trains a multilingual model to bridge vision and language modalities for various downstream applications	279
nvlabs/prismer	A deep learning framework for training multi-modal models with vision and language capabilities.	1,299
deepseek-ai/deepseek-vl	A multimodal AI model that enables real-world vision-language understanding applications	2,145
nvlabs/eagle	Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions	549
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314
jayleicn/clipbert	An efficient framework for end-to-end learning on image-text and video-text tasks	709
byungkwanlee/collavo	Develops a PyTorch implementation of an enhanced vision language model	93
yiren-jian/blitext	Develops and trains models for vision-language learning with decoupled language pre-training	24
baaivision/emu	A multimodal generative model framework	1,672
freedomintelligence/allava	A collection of datasets and models designed to support the training of lite vision-language models.	249
shizhediao/davinci	Implementing a unified modal learning framework for generative vision-language models	43
paganpasta/eqxvision	A package of pre-trained computer vision models for image classification and segmentation.	102
awni/speech	A PyTorch implementation of end-to-end speech recognition models.	756
vishaal27/sus-x	This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required.	94
wisconsinaivision/vip-llava	A system designed to enable large multimodal models to understand arbitrary visual prompts	302