Lahiru SamarakoonIvan W. H. Fung
Self-attention has become a vital component for end-to-end (E2E) automatic speech recognition (ASR). Convolution-augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. This paper proposes a positional encoding (PE) mechanism called Scaled Untied RPE that unties the feature-position correlations in the self-attention computation, and computes feature correlations and positional correlations separately using different projection matrices. In addition, we propose to scale feature correlations with the positional correlations and the aggressiveness of this multiplicative interaction can be configured using a parameter called amplitude. Moreover, we show that the PE matrix can be sliced to reduce model parameters. Our results on National Speech Corpus (NSC) show that Transformer encoders with Scaled Untied RPE achieves relative improvements of 1.9% in accuracy and up to 50.9% in latency over a Conformer baseline respectively.