The high-level features of images are often used to represent the scene features in the image captioning task, because they contain rich semantic information, but the high-level features can only express a feature of global information, and the local information of small objects is easy to be ignored, which makes it difficult to generate the description of small objects, and thus cannot meet the description requirements of finer granularity. To describe the rich semantic information in the image and retain more description of small objects, an image captioning method based on layer feature attention is proposed. Combined with the existing structure of Transformer decoder, the layer feature attention module is designed. Using the multi-layer features of the image, each decoder stack layer can determine the attention to the features of each layer when decoding, and dynamically learn the similarity between the features of each layer and the sequence semantic features to improve the quality of the statement.
Ammara SattarMuhammad AssamTahani Jaser AlahmadiUzair Aslam BhattiHao TangMuhammad Aamir
Chan HeQiujuan TongXiaobao YangJun WangTingge Zhu