Lu ZhaoLiming YuanZhenliang LiXianbin Wen
Multi-Instance Learning (MIL) is a weakly supervised learning paradigm, in which every training example is a labeled bag of unlabeled instances. In typical MIL applications, instances are often used for describing the features of regions/parts in a whole object, e.g., regional patches/lesions in an eye-fundus image. However, for a (semantically) complex part the standard MIL formulation puts a heavy burden on the representation ability of the corresponding instance. To alleviate this pressure, we still adopt a bag-of-instances as an example in this paper, but extract from each instance a set of representations using $1 \times1$ convolutions. The advantages of this tactic are two-fold: i) This set of representations can be regarded as multi-view representations for an instance; ii) Compared to building multi-view representations directly from scratch, extracting them automatically using $1 \times1$ convolutions is more economical, and may be more effective since $1 \times1$ convolutions can be embedded into the whole network. Furthermore, we apply two consecutive multi-instance pooling operations on the reconstituted bag that has actually become a bag of sets of multi-view representations. We have conducted extensive experiments on several canonical MIL data sets from different application domains. The experimental results show that the proposed framework outperforms the standard MIL formulation in terms of classification performance and has good interpretability.
Jingjing TangDewei LiYingjie Tian
Dan LiHaibao WangYufeng WangShengpei Wang
Bing LiChunfeng YuanWeihua XiongWeiming HuHouwen PengXinmiao DingSteve Maybank
Zheng-Jun ZhaXian‐Sheng HuaTao MeiJingdong WangGuo-Jun QiZengfu Wang