We present two solutions to sentence-level SLR. Sentence-level SLR requires mapping videos of sign language sentences to sequences of gloss labels. Connectionist Temporal Classification has been used as the classifier level of both models to avoid pre-segmenting the sentences into individual words. The first model is an LRCN-based model and the second model is a Multi-Cue Network. In the first approach, no prior knowledge has been leveraged. Raw frames are fed into an 18-layer LRCN with a CTC on top. In the second approach, three main characteristics (hand shape, hand position, and hand movement information) associated with each sign have been extracted using Mediapipe. 2D landmarks of hand shape have been used to create the skeleton of the hands and then are fed to a CONV-LSTM model. Hand locations and hand positions as relative distances to head are fed to separate LSTMs. All three sources of information have been then integrated into a Multi-Cue network with a CTC classification layer. We evaluate the performance of proposed models on RWTH-PHOENIX-Weather. After performing an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout, we achieve relatively low Word Error Rate.
Khang TranUyen D. NguyenQuoc Thien Huynh
Ishika GodageRuvan WeerasignheDamitha Sandaruwan
Xianjia MengLin FengXiao Wei YinHuanting ZhouSheng ChangChongyang WangAnxun DuLinzhi Xu
Chethana KumaraNagendraswamy H.S.