"Indeed, Image Captioning has become a crucial
aspect of contemporary artificial intelligence because it has
tackled two crucial parts of the AI field: Computer Vision and
Natural Language Processing. Currently, Bangla stands as the
seventh most widely spoken language globally. Due to this, image
captioning has gained recognition for its significant research
accomplishments. Many established datasets are found in English
but no standard datasets in Bangla. For our research, we have
used the BAN-Cap dataset which contains 8091 images with
40455 sentences. Many effective encoder-decoder and Visual
Attention approaches are used for image captioning where CNN
is utilized for the encoder and RNN is used for the decoder.
However, we suggested a transformer-based image captioning
model in this study with different pre-train image feature
extraction models like Resnet50, InceptionV3, and VGG16 using
the BAN-Cap dataset and find out its effective efficiency and
accuracy based on many performances measured methods like
BLEU, METEOR, ROUGE, CIDEr and also find out the
drawbacks of others model. "