UnIVAL

Multitask model

A unified model for image, video, audio, and language tasks that can be fine-tuned for various downstream applications.

[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.

GitHub

224 stars
5 watching
22 forks
Language: Jupyter Notebook
last commit: about 1 year ago

Related projects:

Repository Description Stars
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 21
jiayuzhou/malsar Provides a comprehensive framework for multi-task learning via structural regularization in MATLAB. 133
haozhezhao/mic Develops a multimodal vision-language model to enable machines to understand complex relationships between instructions and images in various tasks. 337
yfzhang114/slime Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. 143
joez17/chatbridge A unified multimodal language model capable of interpreting and reasoning about various modalities without paired data. 49
yuliang-liu/monkey An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. 1,849
tiger-ai-lab/uniir Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. 114
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 73
xverse-ai/xverse-v-13b A large multimodal model for visual question answering, trained on a dataset of 2.1B image-text pairs and 8.2M instruction sequences. 78
subho406/omninet An implementation of a unified architecture for multi-modal multi-task learning using PyTorch. 515
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 15
uw-madison-lee-lab/cobsat Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks 30
kohjingyu/fromage A framework for grounding language models to images and handling multimodal inputs and outputs 478
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,098
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 751