VILA

Video understanding framework

A visual language model that leverages pre-trained models and large-scale training to understand video and images, enabling applications like video reasoning and in-context learning.

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

GitHub

2k stars
32 watching
168 forks
Language: Python
last commit: about 1 month ago