We address the problem of low resource machine learning in the form of few-shot learning (FSL) applied to word recognition in both mono-lingual and cross-lingual settings. Recently, we proposed an adaptation of a FSL framework, matching networks (MN) to a suite of speech recognition tasks such as multi-speaker small-to-medium vocabulary word recognition and frame-wise phoneme recognition tasks under mel-spectrogram and single-frame feature representations. In this paper, we extend this FSL adaptation of MN to multi-speaker isolated word recognition (IWR), in a framework termed MN-IWR. The IWR task is specifically set in a 'command-and-control' (C&C) scenario with the requirement of needing only very few-shot examples (e.g. up to 20) for a target IWR classification task with vocabularies defined dynamically. Moreover, our proposed MN-IWR framework addresses a cross-domain and cross-lingual setting defined as below: a model is trained on a possibly large set of words in a source-language and used for inference on a cross-domain task (vocabulary of words different from the training vocabulary) or a cross-lingual task (vocabulary of words from a target-language different from the source-language). In this work, we present the main formulation of the MN-IWR framework, its adaptation from source-to-target tasks and results on TIMIT vocabulary of words in a mono-lingual setting and on English, Kannada and Tamil words in cross-lingual settings and report very high performances of the proposed MN-IWR FSL paradigm over conventional IWR classification without the FSL advantage of the MN formulation.
Yunus Can BilgeNazlı İkizler-CinbişRamazan Gökberk Cinbiş
Tao JiYong JiangTao WangZhongqiang HuangFei HuangYuanbin WuXiaoling Wang
Wei LiHarish Tayyar MadabushiMark Lee
Qiantong XuAlexei BaevskiMichael Auli
Anders SøgaardIvan VulićSebastian RuderManaal Faruqui