Improving End-to-End Sign Language Translation With Adaptive Video Representation Enhanced Transformer

Zidong Liu; Jiasong Wu; Zeyu Shen; Xin Chen; Qianyu Wu; Zhiguo Gui; Lotfi Senhadji; Huazhong Shu

doi:10.1109/tcsvt.2024.3376404

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving End-to-End Sign Language Translation With Adaptive Video Representation Enhanced Transformer

Zidong Liu Jiasong Wu Zeyu Shen Xin Chen Qianyu Wu Zhiguo Gui Lotfi Senhadji Huazhong Shu

Year: 2024 Journal: IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (9)Pages: 8327-8342 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tcsvt.2024.3376404

Get Full-Text PDF Get Analytical Report

Abstract

The aim of end-to-end sign language translation (SLT) is to interpret continuous sign language (SL) video sequences into coherent natural language sentences without any intermediary annotations, i.e., glosses. However, end-to-end SLT suffers several intractable issues: (i) the temporal correspondence constraint loss problem between SL videos and glosses, and (ii) the weakly supervised sequence labeling problem between long SL videos and sentences. To address these issues, we propose an adaptive video representation enhanced Transformer (AVRET), with three extra modules: adaptive masking (AM), local clip self-attention (LCSA) and adaptive fusion (AF). Specifically, we utilize the first AM module to generate a special mask that adaptively drops out temporally important SL video frame representations to enhance the SL video features. Then, we pass the masked video feature to the Transformer encoder consisting of LCSA and masked self-attention to learn clip-level and continuous video-level feature information. Finally, the output feature of encoder is fused with the temporal feature of AM module via the AF module and use the second AM module to generate more robust feature representations. Besides, we add weakly supervised loss terms to constrain these two AM modules. To promote the Chinese SLT research, we further construct CSL-FocusOn, a Chinese continuous SLT dataset, and share its collection method. It involves many common scenarios, and provides SL sentence annotations and multi-cue images of signers. Our experiments on the CSL-FocusOn, PHOENIX14T, and CSL-Daily datasets show that the proposed method achieves the competitive performance on the end-to-end SLT task without using glosses in training. The code is available at https://github.com/LzDddd/AVRET.

Keywords:

Computer science Transformer Translation (biology) Machine translation Artificial intelligence Speech recognition Natural language processing Computer vision Electrical engineering Engineering Voltage

Metrics

Cited By

8.55

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Hand Gesture Recognition Systems

Physical Sciences → Computer Science → Human-Computer Interaction

Subtitles and Audiovisual Media

Social Sciences → Arts and Humanities → Language and Linguistics

Hearing Impairment and Communication

Social Sciences → Psychology → Developmental and Educational Psychology

Improving End-to-End Sign Language Translation With Adaptive Video Representation Enhanced Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

End to End Simple Indian Sign Language Sentence Translation Using Sign Transformer Network

Improving End-to-End Sign Language Translation via Multi-Level Contrastive Learning

End-to-End Two-Handed Sign Language Translation

Gloss-Free End-to-End Sign Language Translation

Adapting Transformer to End-to-End Spoken Language Translation