Zero-Shot Human-Object Interaction (HOI) Classification by Bridging Generative and Contrastive Image-Language Models

Ying Jin; Yinpeng Chen; Jianfeng Wang; Lijuan Wang; Jenq–Neng Hwang; Zicheng Liu

doi:10.1109/icip49359.2023.10222927

ScienceGate Book Chapters

JOURNAL ARTICLE

Zero-Shot Human-Object Interaction (HOI) Classification by Bridging Generative and Contrastive Image-Language Models

Ying Jin Yinpeng Chen Jianfeng Wang Lijuan Wang Jenq–Neng Hwang Zicheng Liu

Year: 2023 Pages: 1970-1974

DOI: 10.1109/icip49359.2023.10222927

Get Full-Text PDF Get Analytical Report

Abstract

Existing studies in Human-Object Interaction (HOI) classification rely on costly human-annotated labels. The goal of this paper is to study a new zero-shot setup to remove the dependency on ground-truth labels. We propose a novel Heterogenous Teacher-Student (HTS) framework and a new loss function. HTS employs a generative pretrained image captioner as the teacher and a contrastive pre-trained classifier as the student. HTS combines the discriminability from generative pre-training and efficiency from contrastive pre-training. To facilitate learning of HOI in this setup, we introduce pseudo-label filtering which aggregates HOI probabilities from multiple regional captions to supervise the student. To enhance the multi-label learning of the student on few-shot classes, we design LogSumExp (LSE)-Sign loss which features a dynamic gradient re-weighting mechanism. Eventually, the student achieves 49.6 mAP on the HICO dataset without using ground truth, becoming a new state-of-the-art method that outperforms supervised approaches. Code is available.

Keywords:

Computer science Bridging (networking) Generative grammar Classifier (UML) Weighting Artificial intelligence Ground truth Generative model GRASP Pattern recognition (psychology) Dependency (UML) Object (grammar) Natural language processing Machine learning

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.11

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Zero-Shot Human-Object Interaction (HOI) Classification by Bridging Generative and Contrastive Image-Language Models

Abstract

Metrics

Topics

Related Documents

Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

HZSCM: Hyperspectral Image Zero-Shot Classification via Vision-Language Models

Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction