Implementation of a Model for Generating Text Descriptions from Images Using Deep Learning
Keywords:
Image Captioning, Deep Learning, ResNet-50, BLIP, Computer Vision, Natural Language Processing, Transformers, Self-Attention, Text GenerationAbstract
Advancements in deep learning and computer vision have significantly increased the relevance of automatic image captioning across various fields such as digital content creation, accessibility for visually impaired individuals, and intelligent multimedia systems. However, many existing models are reliant on high-performance computational resources, particularly Graphics Processing Units (GPUs), which limits their accessibility in resource-constrained environments.
This study proposes a novel image captioning model that combines ResNet-50 for efficient feature extraction and BLIP (Bootstrapped Language-Image Pretraining) for text generation, optimized to run efficiently on Central Processing Units (CPUs). Despite operating in a CPU-only environment, the model achieves competitive performance on standard evaluation metrics such as BLEU, METEOR, and CIDEr, making it more accessible for low-resource settings. This work contributes to making advanced image captioning technologies more widely accessible, particularly to academic researchers and smaller organizations, while also enhancing digital content accessibility for visually impaired individuals
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.