Vary

Document comprehension model

An implementation of a vision vocabulary model for large language models to improve document understanding and recognition capabilities

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

GitHub

2k stars

54 watching

159 forks

Language: Python

last commit: about 1 year ago

Related projects:

Repository	Description	Stars
360cvgroup/360vl	A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities.	32
sergioburdisso/pyss3	A Python package implementing an interpretable machine learning model for text classification with visualization tools	336
sicara/tf-explain	A library providing interpretability methods for TensorFlow 2.x models	1,019
interpretml/dice	Provides counterfactual explanations for machine learning models to facilitate interpretability and understanding.	1,373
byungkwanlee/collavo	Develops a PyTorch implementation of an enhanced vision language model	93
akosiorek/attend_infer_repeat	An implementation of Attend, Infer, Repeat, a method for fast scene understanding using generative models.	82
jalammar/ecco	An interactive visualization library for exploring and understanding transformer-based language models	1,986
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314
shizhediao/davinci	Implementing a unified modal learning framework for generative vision-language models	43
msracver/fcis	An implementation of a fully convolutional instance-aware semantic segmentation framework using CUDA.	1,567
princeton-nlp/charxiv	An evaluation suite for assessing chart understanding in multimodal large language models.	85
yuweihao/mm-vet	Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics	274
tca19/dict2vec	A framework to learn word embeddings using lexical dictionaries	115
baaivision/eve	A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities	246
lxtgh/omg-seg	Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.	1,336