JOURNAL ARTICLE

Zero-Shot Human-Object Interaction (HOI) Classification by Bridging Generative and Contrastive Image-Language Models

Abstract

Existing studies in Human-Object Interaction (HOI) classification rely on costly human-annotated labels. The goal of this paper is to study a new zero-shot setup to remove the dependency on ground-truth labels. We propose a novel Heterogenous Teacher-Student (HTS) framework and a new loss function. HTS employs a generative pretrained image captioner as the teacher and a contrastive pre-trained classifier as the student. HTS combines the discriminability from generative pre-training and efficiency from contrastive pre-training. To facilitate learning of HOI in this setup, we introduce pseudo-label filtering which aggregates HOI probabilities from multiple regional captions to supervise the student. To enhance the multi-label learning of the student on few-shot classes, we design LogSumExp (LSE)-Sign loss which features a dynamic gradient re-weighting mechanism. Eventually, the student achieves 49.6 mAP on the HICO dataset without using ground truth, becoming a new state-of-the-art method that outperforms supervised approaches. Code is available.

Keywords:
Computer science Bridging (networking) Generative grammar Classifier (UML) Weighting Artificial intelligence Ground truth Generative model GRASP Pattern recognition (psychology) Dependency (UML) Object (grammar) Natural language processing Machine learning

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
32
Refs
0.11
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.