LAVIS
vision-language toolkit
A library that provides pre-trained models and frameworks for multimodal vision-language intelligence tasks such as image captioning and visual question answering.
LAVIS - A One-stop Library for Language-Vision Intelligence
10k stars
99 watching
972 forks
Language: Jupyter Notebook
last commit: about 1 month ago
Linked from 3 awesome lists
deep-learningdeep-learning-libraryimage-captioningmultimodal-datasetsmultimodal-deep-learningsalesforcevision-and-languagevision-frameworkvision-language-pretrainingvision-language-transformervisual-question-anwsering
Related projects:
Repository | Description | Stars |
---|---|---|
haotian-liu/llava | A system that uses large language and vision models to generate and process visual instructions | 20,232 |
nvidia/nemo | A scalable generative AI framework for creating and deploying large language models and multimodal models | 12,118 |
luogen1996/lavin | An open-source implementation of a vision-language instructed large language model | 508 |
freedomintelligence/allava | A collection of datasets and models designed to support the training of lite vision-language models. | 246 |
dvlab-research/mgm | An open-source framework for training large language models with vision capabilities. | 3,211 |
llava-vl/llava-next | Develops large multimodal models for various computer vision tasks including image and video analysis | 2,872 |
vision-cair/minigpt-4 | Enabling vision-language understanding by fine-tuning large language models on visual data. | 25,422 |
eleutherai/lm-evaluation-harness | Provides a unified framework to test generative language models on various evaluation tasks. | 6,970 |
qwenlm/qwen2-vl | A multimodal large language model series developed by the Qwen team to understand and process images, videos, and text. | 3,093 |
qwenlm/qwen-vl | A large vision language model with improved image reasoning and text recognition capabilities, suitable for various multimodal tasks | 5,045 |
jy0205/lavit | A unified framework for training large language models to understand and generate visual content | 528 |
google-research/big_vision | Supports large-scale vision model training on GPU machines or Google Cloud TPUs using scalable input pipelines. | 2,334 |
optimalscale/lmflow | A toolkit for finetuning large language models and providing efficient inference capabilities | 8,273 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,298 |
sgl-project/sglang | A framework for serving large language models and vision models with efficient runtime and flexible interface. | 6,082 |