JOURNAL ARTICLE

Scene-aware Human Pose Generation using Transformer

Abstract

Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.

Keywords:
Affordance Computer science Artificial intelligence Embedding Transformer Template Human–computer interaction Computer vision Engineering

Metrics

36
Cited By
6.55
FWCI (Field Weighted Citation Impact)
38
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Motion and Animation
Physical Sciences →  Engineering →  Control and Systems Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.