Dalal Abdulmohsin HammoodHayder jasim habilIbtehal Shakir MahmoudEffariza HanafiUniversiti Malaya; Kuala Lumpur
The ability to relate information about languages heard through visual and audio data is a crucial aspect of audio-visual speech recognition (AVSR), which has uses in data manipulation for audio-visual correspondence, including AVE-Net and SyncNet. The technique described in this research uses feature disentanglement to simultaneously handle the tasks listed above. By developing cross-modal standard learning methods, this model can transform visual or aural linguistic characteristics into modality-independent representations. AVE-Net and SyncNet can all be performed with the help of such derived linguistic expressions. Furthermore, audio and visual data output can be modified based on the required subject identity and linguistic content information. We do comprehensive trials on various recognition and synthesis tasks on both tasks separately, and that solution can successfully take on both audio-visual learning problems. The system gives great results in the enhanced video with 91.5% with 5 frames, while this will increase with the increase of frames with 99.03% with 15 frames, which is more efficient than the previous methods.
Chih-Chun YangWan-Cyuan FanCheng-Fu YangYu-Chiang Frank Wang
Kuniaki NodaYuki YamaguchiKazuhiro NakadaiHiroshi G. OkunoTetsuya Ogata
Cong JinTian ZhangShouxun LiuYun TieXin LvJianguang LiWencai YanMing YanQian XuYicong GuanZhenggougou Yang
Youssef MrouehEtienne MarcheretVaibhava Goel
L. Ashok KumarD. Karthika RenukaS. Lovelyn RoseM. C. Shunmugapriya