JOURNAL ARTICLE

Context-aware Scene Graph Generation with Seq2Seq Transformers

Abstract

Scene graph generation is an important task in computer vision aimed at improving the semantic understanding of the visual world. In this task, the model needs to detect objects and predict visual relationships between them. Most of the existing models predict relationships in parallel assuming their independence. While there are different ways to capture these dependencies, we explore a conditional approach motivated by the sequence-to-sequence (Seq2Seq) formalism. Different from the previous research, our proposed model predicts visual relationships one at a time in an autoregressive manner by explicitly conditioning on the already predicted relationships. Drawing from translation models in NLP, we propose an encoder-decoder model built using Transformers where the encoder captures global context and long range interactions. The decoder then makes sequential predictions by conditioning on the scene graph constructed so far. In addition, we introduce a novel reinforcement learning-based training strategy tailored to Seq2Seq scene graph generation. By using a self-critical policy gradient training approach with Monte Carlo search we directly optimize for the (mean) recall metrics and bridge the gap between training and evaluation. Experimental results on two public benchmark datasets demonstrate that our Seq2Seq learning approach achieves strong empirical performance, outperforming previous state-of-the-art, while remaining efficient in terms of training and inference time. Full code for this work is available here: https://github.com/layer6ai-labs/SGG-Seq2Seq.

Keywords:
Computer science Reinforcement learning Encoder Artificial intelligence Transformer Inference Machine learning Graph Theoretical computer science

Metrics

78
Cited By
3.86
FWCI (Field Weighted Citation Impact)
97
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Multi-modal Context-Aware Network for Scene Graph Generation

Junjie YeBing‐Kun BaoZhiyi Tan

Lecture notes in computer science Year: 2023 Pages: 335-347
JOURNAL ARTICLE

Scene Graph Generation With Hierarchical Context

Guanghui RenLejian RenYue LiaoSi LiuBo LiJizhong HanShuicheng Yan

Journal:   IEEE Transactions on Neural Networks and Learning Systems Year: 2020 Vol: 32 (2)Pages: 909-915
BOOK-CHAPTER

Scene Graph Generation with Geometric Context

Vishal KumarAlbert MunduSatish Kumar Singh

Communications in computer and information science Year: 2022 Pages: 340-350
JOURNAL ARTICLE

Uncertainty-Aware Scene Graph Generation

Xuewei LiTao WuGuangcong ZhengYunlong YuXi Li

Journal:   Pattern Recognition Letters Year: 2022 Vol: 167 Pages: 30-37
JOURNAL ARTICLE

Distribution-aware network with context and entity attention for scene graph generation

Te‐Cheng PanLulu WangRuoyu ZhangZhengtao YuYingna Li

Journal:   Engineering Applications of Artificial Intelligence Year: 2025 Vol: 160 Pages: 111984-111984
© 2026 ScienceGate Book Chapters — All rights reserved.