Vary

Document comprehension model

An implementation of a vision vocabulary model for large language models to improve document understanding and recognition capabilities

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

GitHub

2k stars
54 watching
159 forks
Language: Python
last commit: 16 days ago

Related projects:

Repository Description Stars
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 32
sergioburdisso/pyss3 A Python package implementing an interpretable machine learning model for text classification with visualization tools 336
sicara/tf-explain A library providing interpretability methods for TensorFlow 2.x models 1,019
interpretml/dice Provides counterfactual explanations for machine learning models to facilitate interpretability and understanding. 1,373
byungkwanlee/collavo Develops a PyTorch implementation of an enhanced vision language model 93
akosiorek/attend_infer_repeat An implementation of Attend, Infer, Repeat, a method for fast scene understanding using generative models. 82
jalammar/ecco An interactive visualization library for exploring and understanding transformer-based language models 1,986
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314
shizhediao/davinci Implementing a unified modal learning framework for generative vision-language models 43
msracver/fcis An implementation of a fully convolutional instance-aware semantic segmentation framework using CUDA. 1,567
princeton-nlp/charxiv An evaluation suite for assessing chart understanding in multimodal large language models. 85
yuweihao/mm-vet Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics 274
tca19/dict2vec A framework to learn word embeddings using lexical dictionaries 115
baaivision/eve A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities 246
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,336