LanguageBind
Multimodal alignment model
Extending pretraining models to handle multiple modalities by aligning language and video representations
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
751 stars
15 watching
52 forks
Language: Python
last commit: 10 months ago language-centralmulti-modalpretrainingzero-shot
Related projects:
Repository | Description | Stars |
---|---|---|
pku-alignment/align-anything | Aligns large multimodal models with human intentions and values using various algorithms and fine-tuning methods. | 270 |
pku-yuangroup/video-bench | Evaluates and benchmarks large language models' video understanding capabilities | 121 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 15 |
yiren-jian/blitext | Develops and trains models for vision-language learning with decoupled language pre-training | 24 |
jshilong/gpt4roi | Training and deploying large language models on computer vision tasks using region-of-interest inputs | 517 |
lyuchenyang/macaw-llm | A multi-modal language model that integrates image, video, audio, and text data to improve language understanding and generation | 1,568 |
shawn-ieitsystems/yuan-1.0 | Large-scale language model with improved performance on NLP tasks through distributed training and efficient data processing | 591 |
lancopku/iais | This project proposes a novel method for calibrating attention distributions in multimodal models to improve contextualized representations of image-text pairs. | 30 |
ethanyanjiali/minchatgpt | This project demonstrates the effectiveness of reinforcement learning from human feedback (RLHF) in improving small language models like GPT-2. | 214 |
pku-yuangroup/moe-llava | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
pku-yuangroup/chat-univi | A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data. | 895 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 73 |
sihengli99/textbind | Enables larger language models to generate multi-turn multimodal instruction-response conversations from image-caption pairs with minimal annotations. | 47 |
brightmart/xlnet_zh | Trains a large Chinese language model on massive data and provides a pre-trained model for downstream tasks | 230 |
haozhezhao/mic | Develops a multimodal vision-language model to enable machines to understand complex relationships between instructions and images in various tasks. | 337 |