UnIVAL
Multitask model
A unified model for image, video, audio, and language tasks that can be fine-tuned for various downstream applications.
[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.
224 stars
5 watching
22 forks
Language: Jupyter Notebook
last commit: 11 months ago Related projects:
Repository | Description | Stars |
---|---|---|
mshukor/evalign-icl | Evaluating and improving large multimodal models through in-context learning | 20 |
jiayuzhou/malsar | A collection of multi-task learning algorithms using structural regularization techniques to improve performance on multiple related tasks. | 133 |
haozhezhao/mic | Develops a multimodal vision-language model to enable machines to understand complex relationships between instructions and images in various tasks. | 334 |
yfzhang114/slime | Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. | 137 |
joez17/chatbridge | A unified multimodal language model capable of interpreting and reasoning about various modalities without paired data. | 47 |
yuliang-liu/monkey | A toolkit for building conversational AI models that can process images and text inputs. | 1,825 |
tiger-ai-lab/uniir | Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. | 110 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 72 |
xverse-ai/xverse-v-13b | A large multimodal model for visual question answering, trained on a dataset of 2.1B image-text pairs and 8.2M instruction sequences. | 77 |
subho406/omninet | An implementation of a unified architecture for multi-modal multi-task learning using PyTorch. | 512 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 14 |
uw-madison-lee-lab/cobsat | Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks | 28 |
kohjingyu/fromage | A framework for grounding language models to images and handling multimodal inputs and outputs | 478 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,089 |
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 723 |