Multi-label image classification is a fundamental classification task, which seeks to assign numerous possible labels to an image.Many deep convolutional neural network (CNN)-based approaches to discovering the semantics of labels and learning the semantic representation of images by modeling label correlation have been proposed in recent years.However, some small and similar objects cannot be predicted accurately due to the limitation of convolutional kernel representation capability.As a result, in order to solve this problem, this paper introduces twins-transformer.Since different stages of image representation of this model capture different levels or scales of features and have different discriminative capacities, we design a multi-stage semantic attention with transformer (MAST) framework to learn the semantic representation of images using its own multi-stage mechanism, while employing a three-layer standard transformer decoder as an effective component for feature fusion.Experiments conducted on the VOC 2007 dataset show that MSAT achieves better experimental results and improves the performance of multi-label image classification tasks to some extent.
Jin YuanShikai ChenYao ZhangZhongchao ShiXin GengJianping FanYong Rui
Wei ZhouZhijie ZhengTao SuHaifeng Hu
Liwen WuLei ZhaoPeigeng TangBin PuXin JinYudong ZhangShaowen Yao
Lu JiangJihua YeShunjie XiaoYi ZongAiwen Jiang