R2D2

Vision-Language Framework

A framework for large-scale cross-modal benchmarks and vision-language tasks in Chinese

GitHub

157 stars
2 watching
23 forks
Language: Python
last commit: about 1 year ago

Related projects:

Repository Description Stars
zhourax/vega Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. 33
hxyou/idealgpt A deep learning framework for iteratively decomposing vision and language reasoning via large language models. 32
shizhediao/davinci Implementing a unified modal learning framework for generative vision-language models 43
yuliang-liu/monkey An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. 1,849
yiren-jian/blitext Develops and trains models for vision-language learning with decoupled language pre-training 24
wpiroboticsprojects/grip A computer vision framework for robotics applications that simplifies the creation of vision systems and generates code in multiple programming languages. 380
vlf-silkie/vlfeedback An annotated preference dataset and training framework for improving large vision language models. 88
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
baai-wudao/brivl Pre-trains a multilingual model to bridge vision and language modalities for various downstream applications 279
openuc2/uc2-git An open-source software framework for building modular electro-optical projects with interchangeable components 468
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314
xiaoyufenfei/lednet A lightweight deep learning framework for real-time semantic segmentation 514
yulingtianxia/core-ml-sample A demo project demonstrating the integration of Core ML and Vision Framework with Swift 4 for image classification using an Inception V3 network. 217
vishaal27/sus-x This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required. 94
kohjingyu/fromage A framework for grounding language models to images and handling multimodal inputs and outputs 478