IDA-VLM
Identity-aware video model
An open-source project that aims to improve large vision-language models by integrating identity-aware capabilities and utilizing visual instruction tuning data for movie understanding
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
26 stars
4 watching
0 forks
Language: Python
last commit: 3 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
| This project develops an AI model for long-term video understanding | 254 |
| An implementation of a vision language model designed for mobile devices, utilizing a lightweight downsample projector and pre-trained language models. | 1,076 |
| A guide to using pre-trained large language models in source code analysis and generation | 1,789 |
| A large language model designed to process and generate visual information | 956 |
| A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| An MLLM architecture designed to align visual and textual embeddings through structural alignment | 575 |
| A toolset for understanding and interpreting complex machine learning models | 22 |
| Improves large vision-language models' ability to accurately describe images by combining global and local attention mechanisms. | 18 |
| A high-performance language model designed to excel in tasks like natural language understanding, mathematical computation, and code generation | 182 |
| An open-source project that enables the transfer of language understanding to vision capabilities through long context processing. | 347 |
| An autonomous driving project exploring the capabilities of a visual-language model in understanding complex driving scenes and making decisions | 288 |
| An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 84 |
| An implementation of a neural network model for character-level language modeling. | 50 |