Disentangled Representation Learning for Cross-Modal Biometric Matching

Hailong Ning; Xiangtao Zheng; Xiaoqiang Lu; Yuan Yuan

doi:10.1109/tmm.2021.3071243

ScienceGate Book Chapters

JOURNAL ARTICLE

Disentangled Representation Learning for Cross-Modal Biometric Matching

Hailong Ning Xiangtao Zheng Xiaoqiang Lu Yuan Yuan

Year: 2021 Journal: IEEE Transactions on Multimedia Vol: 24 Pages: 1763-1774 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2021.3071243

Get Full-Text PDF Get Analytical Report

Abstract

<p>Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods. IEEE</p>

Keywords:

Computer science Identity (music) Modal Biometrics Matching (statistics) Modality (human–computer interaction) Feature (linguistics) Feature learning Representation (politics) Feature extraction Pattern recognition (psychology) Latent variable Artificial intelligence Modalities Face (sociological concept) Mathematics Linguistics Statistics

Metrics

Cited By

3.27

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Biometric Identification and Security

Physical Sciences → Computer Science → Signal Processing

Disentangled Representation Learning for Cross-Modal Biometric Matching

Abstract

Metrics

Citation History

Topics

Related Documents

DRLHomo: Disentangled Representation Learning for Cross-Modal Homography Estimation

Quaternion Representation Learning for cross-modal matching

Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

SSDMM-VAE: variational multi-modal disentangled representation learning