Hand gesture recognition is a classical problem in human-computer interaction research. In this paper, a learning-based model is proposed for hand gesture recognition. Our model receives RGB and depth channels input. To recognize the hand gesture, the segmentation of hand region is the important issue. At first, we apply patch embedding layer to encode all the frames as several patches. Then, these encoded patches are fed into 3D convolution network. The 3D convolution layer can simultaneously learn the spatial and temporal feature of the video. The 3D convolution network also contains attention block, which is used to enhance the crucial feature map value. Besides, the encoded patches pass through the local decoder to recover the depth frames of the video. This operation can preserve the depth information in encoded patches. At last, we perform the linear classifier for the output of 3D convolution network to get the result of hand gesture. Our method achieves 80.5% accuracy in the NV-Gesture dataset and 89.6% accuracy in the SKIG dataset.
Mengmeng HanJiajun ChenLing LiYuchun Chang
Gongfa LiHeng TangYing SunJianyi KongGuozhang JiangDu JiangBo TaoShuang XuHonghai Liu
Hsien-I LinMing-Hsiang HsuWei‐Kai Chen