Existing studies in Human-Object Interaction (HOI) classification rely on costly human-annotated labels. The goal of this paper is to study a new zero-shot setup to remove the dependency on ground-truth labels. We propose a novel Heterogenous Teacher-Student (HTS) framework and a new loss function. HTS employs a generative pretrained image captioner as the teacher and a contrastive pre-trained classifier as the student. HTS combines the discriminability from generative pre-training and efficiency from contrastive pre-training. To facilitate learning of HOI in this setup, we introduce pseudo-label filtering which aggregates HOI probabilities from multiple regional captions to supervise the student. To enhance the multi-label learning of the student on few-shot classes, we design LogSumExp (LSE)-Sign loss which features a dynamic gradient re-weighting mechanism. Eventually, the student achieves 49.6 mAP on the HICO dataset without using ground truth, becoming a new state-of-the-art method that outperforms supervised approaches. Code is available.
Dario ZancaAndrea ZugariniSimon DietzThomas AltstidlMark A. Turban NdjeuhaMoumita ChakrabortyNaga Venkata Sai Jitin JamiLeo SchwinnBjoern M. Eskofier
Donggoo KangDasol JeongHyunmin LeeSangwoo ParkHasil ParkSunkyu KwonYeongjoon KimJoonki Paik
Lingbo HuangYushi ChenZhaokui LiPedram GhamisiQian Du
Moyuru YamadaNimish DharamshiAyushi KohliPrasad KasuAinulla KhanManu Ghulyani