Hao ZhangYixiang SunZenghui LiuQiyuan LiuXiyao LiuMing JiangGerald SchaferHui Fang
Sign language translation (SLT) has attracted significant interest both from research and industry, enabling convenient communications with the deaf-mute community. While recent transformer-based models have shown improved sign translation performance, it is still under-explored how to design an efficient transformer-based deep network architecture that effectively extracts joint visual-text features by exploiting multi-level spatial and temporal contextual information. In this paper, we propose heterogeneous attention based transformer(HAT), a novel SLT model to generate attentions from diverse spatial and temporal contextual levels. Specifically, the proposed light dual-stream sparse attention-based module yields more effective visual-text representations compared to conventional transformers. Extensive experiments demonstrate that our HAT achieves state-of-the-art performance on the challenging PHOENIX2014T benchmark dataset with a BLEU-4 score of 25.33 on the test set.
Hao ZhangYixiang SunZenghui LiuQiyuan LiuXiyao LiuMing JiangGerald SchaferHui Fang
Tahsinul Haque DhruboMd Tanzim RezaSihab SahariarWasif Afnan Mukto
Zihui GuoYonghong HouChunping HouWenjie Yin
Yingchun XieWei SuC. ZhongChuan CaiYongna Yuan
Menaka NarayananKancharana Manideep BharadwajG NithinDhiganth Rao PadamnoorVineeth Vijayaraghavan