Video-ChatGPT
Video conversational model
A video conversation model that generates meaningful conversations about videos using large vision and language models
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
1k stars
15 watching
110 forks
Language: Python
last commit: 6 months ago chatbotclipgpt-4llamallavamulit-modalvicunavideo-chatboatvideo-conversationvision-languagevision-language-pretraining
Related projects:
Repository | Description | Stars |
---|---|---|
| An end-to-end trained model capable of generating natural language responses integrated with object segmentation masks for interactive visual conversations | 797 |
| A conversational language model developed to improve understanding of complex instructions and Chinese vocabulary. | 62 |
| An advanced language model designed to generate human-like responses in various domains and applications | 101 |
| A commercially viable web application for conversational AI built with React and OpenAI's ChatGPT technology | 1,366 |
| A large language model with 70 billion parameters designed for chatbot and conversational AI tasks | 29 |
| A large language model designed to understand long videos by binding visual content with timestamps and producing video token sequences of varying lengths. | 314 |
| A large language model designed to support long context conversations with improved efficiency and effectiveness | 38 |
| Trains a multimodal chatbot that combines visual and language instructions to generate responses | 1,478 |
| A conversational AI demo powered by a large language model | 78 |
| An evaluation platform for comparing multi-modality models on visual question-answering tasks | 478 |
| A Discord bot that enables interactive conversations with ChatGPT using a single command. | 291 |
| A multimodal chatbot with computer vision capabilities integrated into a single model | 99 |
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| An intelligent system that enables automatic control and utilization of visual foundation models to interact with images in conversational settings. | 762 |
| Transforms video content into a long document containing visual and audio information that can be used for chat or other applications. | 545 |