UnIVAL
Multitask model
A unified model for image, video, audio, and language tasks that can be fine-tuned for various downstream applications.
[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.
224 stars
5 watching
22 forks
Language: Jupyter Notebook
last commit: about 1 year ago Related projects:
Repository | Description | Stars |
---|---|---|
mshukor/evalign-icl | Evaluating and improving large multimodal models through in-context learning | 21 |
jiayuzhou/malsar | Provides a comprehensive framework for multi-task learning via structural regularization in MATLAB. | 133 |
haozhezhao/mic | Develops a multimodal vision-language model to enable machines to understand complex relationships between instructions and images in various tasks. | 337 |
yfzhang114/slime | Develops large multimodal models for high-resolution understanding and analysis of text, images, and other data types. | 143 |
joez17/chatbridge | A unified multimodal language model capable of interpreting and reasoning about various modalities without paired data. | 49 |
yuliang-liu/monkey | An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. | 1,849 |
tiger-ai-lab/uniir | Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. | 114 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 73 |
xverse-ai/xverse-v-13b | A large multimodal model for visual question answering, trained on a dataset of 2.1B image-text pairs and 8.2M instruction sequences. | 78 |
subho406/omninet | An implementation of a unified architecture for multi-modal multi-task learning using PyTorch. | 515 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 15 |
uw-madison-lee-lab/cobsat | Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks | 30 |
kohjingyu/fromage | A framework for grounding language models to images and handling multimodal inputs and outputs | 478 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,098 |
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |