AGLA

Image descriptor model

Improves large vision-language models' ability to accurately describe images by combining global and local attention mechanisms.

[Arxiv 2024] AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

GitHub

18 stars
2 watching
0 forks
Language: Python
last commit: 6 months ago

Related projects:

Repository Description Stars
byungkwanlee/collavo Develops a PyTorch implementation of an enhanced vision language model 93
deepseek-ai/deepseek-vl A multimodal AI model that enables real-world vision-language understanding applications 2,145
dvlab-research/lisa A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. 1,923
baaivision/eve A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities 246
yiyangzhou/lure Analyzing and mitigating object hallucination in large vision-language models to improve their accuracy and reliability. 136
andy971022/auto-lama Automates object removal from images using computer vision techniques 99
yfzhang114/llava-align Debiasing techniques to minimize hallucinations in large visual language models 75
damo-nlp-sg/vcd An approach to reduce object hallucinations in large vision-language models by contrasting output distributions derived from original and distorted visual inputs 222
ayoolaolafenwa/pixellib A deep learning library for image segmentation and object detection using PyTorch. 1,054
umass-foundation-model/3d-llm Developing a Large Language Model capable of processing 3D representations as inputs 979
opengvlab/visionllm A large language model designed to process and generate visual information 956
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 21
algolzw/daclip-uir This project controls vision-language models to restore degraded images in various environments and conditions. 673
uclanlp/elmo-c Efficient Contextual Representation Learning Model with Continuous Outputs 4