JOURNAL ARTICLE

Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning

Ankit ShahShijie GengPeng GaoAnoop CherianTakaaki HoriTim K. MarksJonathan Le RouxChiori Hori

Year: 2022 Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 7732-7736

Abstract

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

Keywords:
Computer science Dialog box Transformer Inference Baseline (sea) Artificial intelligence Task (project management) Audio visual Machine learning Multimodal learning Speech recognition Human–computer interaction Multimedia World Wide Web

Metrics

17
Cited By
1.17
FWCI (Field Weighted Citation Impact)
41
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Revisiting audio visual scene-aware dialog

Aishan LiuHuiyuan XieXianglong LiuZixin YinShunchang Liu

Journal:   Neurocomputing Year: 2022 Vol: 496 Pages: 227-237
BOOK-CHAPTER

Multimodal Prompt Learning for Audio Visual Scene-Aware Dialog

Feifei XuFumiaoyue JiaZhou Wang

Lecture notes in computer science Year: 2024 Pages: 87-100
JOURNAL ARTICLE

DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog

Zhe ChenHongcheng LiuYu Wang

Journal:   IEEE/ACM Transactions on Audio Speech and Language Processing Year: 2023 Vol: 32 Pages: 753-764
© 2026 ScienceGate Book Chapters — All rights reserved.