Implementation of a Model for Generating Text Descriptions from Images Using Deep Learning

Zhuldyz Sakhitzhanova

Authors

Zhuldyz Sakhitzhanova Department of Applied Data Analytics, Astana IT University, Astana, Kazakhstan

Keywords:

Image Captioning, Deep Learning, ResNet-50, BLIP, Computer Vision, Natural Language Processing, Transformers, Self-Attention, Text Generation

Abstract

Advancements in deep learning and computer vision have significantly increased the relevance of automatic image captioning across various fields such as digital content creation, accessibility for visually impaired individuals, and intelligent multimedia systems. However, many existing models are reliant on high-performance computational resources, particularly Graphics Processing Units (GPUs), which limits their accessibility in resource-constrained environments.

This study proposes a novel image captioning model that combines ResNet-50 for efficient feature extraction and BLIP (Bootstrapped Language-Image Pretraining) for text generation, optimized to run efficiently on Central Processing Units (CPUs). Despite operating in a CPU-only environment, the model achieves competitive performance on standard evaluation metrics such as BLEU, METEOR, and CIDEr, making it more accessible for low-resource settings. This work contributes to making advanced image captioning technologies more widely accessible, particularly to academic researchers and smaller organizations, while also enhancing digital content accessibility for visually impaired individuals

Implementation of a Model for Generating Text Descriptions from Images Using Deep Learning

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section

License