VILA

Video understanding framework

A visual language model that leverages pre-trained models and large-scale training to understand video and images, enabling applications like video reasoning and in-context learning.

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

GitHub

2k stars

32 watching

168 forks

Language: Python

last commit: 12 months ago