LAVIS

vision-language toolkit

A library that provides pre-trained models and frameworks for multimodal vision-language intelligence tasks such as image captioning and visual question answering.

LAVIS - A One-stop Library for Language-Vision Intelligence

GitHub

10k stars
99 watching
972 forks
Language: Jupyter Notebook
last commit: about 1 month ago
Linked from 3 awesome lists

deep-learningdeep-learning-libraryimage-captioningmultimodal-datasetsmultimodal-deep-learningsalesforcevision-and-languagevision-frameworkvision-language-pretrainingvision-language-transformervisual-question-anwsering

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
haotian-liu/llava A system that uses large language and vision models to generate and process visual instructions 20,232
nvidia/nemo A scalable generative AI framework for creating and deploying large language models and multimodal models 12,118
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 508
freedomintelligence/allava A collection of datasets and models designed to support the training of lite vision-language models. 246
dvlab-research/mgm An open-source framework for training large language models with vision capabilities. 3,211
llava-vl/llava-next Develops large multimodal models for various computer vision tasks including image and video analysis 2,872
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,422
eleutherai/lm-evaluation-harness Provides a unified framework to test generative language models on various evaluation tasks. 6,970
qwenlm/qwen2-vl A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text. 3,093
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,045
jy0205/lavit A unified framework for training large language models to understand and generate visual content 528
google-research/big_vision Supports large-scale vision model training on GPU machines or Google Cloud TPUs using scalable input pipelines. 2,334
optimalscale/lmflow A toolkit for finetuning large language models and providing efficient inference capabilities 8,273
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,298
sgl-project/sglang A framework for serving large language models and vision models with efficient runtime and flexible interface. 6,082