Image captioning can provide automatic annotations or labels for images, describing the objects, scenes or activities present in the image. This can be beneficial in applications that require large-scale image analysis such as image recognition, recommendation systems and content classification. Traditional methods for generating image descriptions rely on rule-based systems, which are time consuming and lack the ability to capture the contextual details of complex images. The proposed method combines the power of two state-of the-art models: attention mechanism and Transformer. Attention mechanism is a powerful model for multimodal understanding that combines the domains of vision and language processing. Attention mechanism leverages pertaining on large-scale datasets to capture semantic relationships between images and text. By combining the multimodal understanding of attention mechanism with the sequence modelling capabilities of Transformer, superior performance in image captioning task can be achieved. The model can better capture complicated context dependence in input captions by adding more transformer layers. The suggested system's performance is assessed using BLEU-1,2,3,4 and other objective evaluation criteria.
Duofeng WangHaifeng HuDihu Chen
Yinghua LiYaping ZhuYana ZhangCheng Yang
Jing ZhangYingshuai XieWeichao DingZhe Wang