LAVIS

vision-language toolkit

A library that provides pre-trained models and frameworks for multimodal vision-language intelligence tasks such as image captioning and visual question answering.

LAVIS - A One-stop Library for Language-Vision Intelligence

GitHub

10k stars
97 watching
978 forks
Language: Jupyter Notebook
last commit: 29 days ago
Linked from 3 awesome lists

deep-learningdeep-learning-libraryimage-captioningmultimodal-datasetsmultimodal-deep-learningsalesforcevision-and-languagevision-frameworkvision-language-pretrainingvision-language-transformervisual-question-anwsering

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
haotian-liu/llava A system that uses large language and vision models to generate and process visual instructions 20,683
nvidia/nemo A scalable generative AI framework for creating and deploying large language models and multimodal models 12,438
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 513
freedomintelligence/allava A collection of datasets and models designed to support the training of lite vision-language models. 249
dvlab-research/mgm An open-source framework for training large language models with vision capabilities. 3,229
llava-vl/llava-next Develops large multimodal models for various computer vision tasks including image and video analysis 3,099
vision-cair/minigpt-4 Enabling vision-language understanding by fine-tuning large language models on visual data. 25,490
eleutherai/lm-evaluation-harness Provides a unified framework to test generative language models on various evaluation tasks. 7,200
qwenlm/qwen2-vl A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text. 3,613
qwenlm/qwen-vl A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks 5,179
jy0205/lavit A unified framework for training large language models to understand and generate visual content 544
google-research/big_vision Supports large-scale vision model training on GPU machines or Google Cloud TPUs using scalable input pipelines. 2,439
optimalscale/lmflow A toolkit for fine-tuning and inferring large machine learning models 8,312
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
sgl-project/sglang A fast serving framework for large language models and vision language models. 6,551