Vary

Document comprehension model

An implementation of a vision vocabulary model for large language models to improve document understanding and recognition capabilities

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

GitHub

2k stars
54 watching
158 forks
Language: Python
last commit: about 2 months ago

Related projects:

Repository Description Stars
360cvgroup/360vl A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. 30
sergioburdisso/pyss3 A Python package implementing an interpretable machine learning model for text classification with visualization tools 336
sicara/tf-explain A library providing interpretability methods for TensorFlow 2.x models 1,018
interpretml/dice Provides counterfactual explanations for machine learning models to facilitate interpretability and understanding. 1,364
byungkwanlee/collavo Develops a PyTorch implementation of an enhanced vision language model 93
akosiorek/attend_infer_repeat An implementation of Attend, Infer, Repeat, a method for fast scene understanding using generative models. 82
jalammar/ecco An interactive visualization library for exploring and understanding transformer-based language models 1,985
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 311
shizhediao/davinci An implementation of vision-language models for multimodal learning tasks, enabling generative vision-language models to be fine-tuned for various applications. 43
msracver/fcis An implementation of a deep learning framework for instance-aware semantic segmentation 1,566
princeton-nlp/charxiv An evaluation suite for assessing chart understanding in multimodal large language models. 75
yuweihao/mm-vet Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics 267
tca19/dict2vec A framework to learn word embeddings using lexical dictionaries 115
baaivision/eve A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities 230
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,300