R2D2

Vision-Language Framework

A framework for large-scale cross-modal benchmarks and vision-language tasks in Chinese

157 stars

2 watching

23 forks

Language: Python

last commit: over 2 years ago

Related projects:

Repository	Description	Stars
zhourax/vega	Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs.	33
hxyou/idealgpt	A deep learning framework for iteratively decomposing vision and language reasoning via large language models.	32
shizhediao/davinci	Implementing a unified modal learning framework for generative vision-language models	43
yuliang-liu/monkey	An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage.	1,849
yiren-jian/blitext	Develops and trains models for vision-language learning with decoupled language pre-training	24
wpiroboticsprojects/grip	A computer vision framework for robotics applications that simplifies the creation of vision systems and generates code in multiple programming languages.	380
vlf-silkie/vlfeedback	An annotated preference dataset and training framework for improving large vision language models.	88
nvlabs/prismer	A deep learning framework for training multi-modal models with vision and language capabilities.	1,299
baai-wudao/brivl	Pre-trains a multilingual model to bridge vision and language modalities for various downstream applications	279
openuc2/uc2-git	An open-source software framework for building modular electro-optical projects with interchangeable components	468
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314
xiaoyufenfei/lednet	A lightweight deep learning framework for real-time semantic segmentation	514
yulingtianxia/core-ml-sample	A demo project demonstrating the integration of Core ML and Vision Framework with Swift 4 for image classification using an Inception V3 network.	217
vishaal27/sus-x	This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required.	94
kohjingyu/fromage	A framework for grounding language models to images and handling multimodal inputs and outputs	478