UnIVAL

Multitask model

A unified model for image, video, audio, and language tasks that can be fine-tuned for various downstream applications.

[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.

GitHub

224 stars
5 watching
22 forks
Language: Jupyter Notebook
last commit: 11 months ago

Related projects:

Repository Description Stars
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 20
jiayuzhou/malsar A collection of multi-task learning algorithms using structural regularization techniques to improve performance on multiple related tasks. 133
haozhezhao/mic Develops a multimodal vision-language model to enable machines to understand complex relationships between instructions and images in various tasks. 334
yfzhang114/slime Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. 137
joez17/chatbridge A unified multimodal language model capable of interpreting and reasoning about various modalities without paired data. 47
yuliang-liu/monkey A toolkit for building conversational AI models that can process images and text inputs. 1,825
tiger-ai-lab/uniir Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. 110
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 72
xverse-ai/xverse-v-13b A large multimodal model for visual question answering, trained on a dataset of 2.1B image-text pairs and 8.2M instruction sequences. 77
subho406/omninet An implementation of a unified architecture for multi-modal multi-task learning using PyTorch. 512
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 14
uw-madison-lee-lab/cobsat Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks 28
kohjingyu/fromage A framework for grounding language models to images and handling multimodal inputs and outputs 478
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,089
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 723