Punctuation restoration (PR) is a crucial task for improving the human readability and usability of speech transcripts in spoken language processing. However, to the best of our knowledge, there is no punctuation restoration model available for Indonesian speech transcripts, especially when it comes to Indonesian-English code-switch speech. This paper introduces transformer-based models that restore punctuations for Indonesian with English code-switch speech transcripts. The study investigates the ability to perform cross-lingual punctuation restoration for Indonesian speech using a model trained with a Malaysian-English codeswitch dataset. Subsequently, models trained on datasets of Indonesian, Indonesian-English codeswitch, and the combination of them plus the Malaysian-English codeswitch dataset are compared. Experimental results show that the cross-lingual model achieved the F1 scores of 40.7% and 41.9% on Indonesian and Indonesian-English codeswitch test sets, respectively. The results are surprisingly low when no Indonesian data has been used for training, even though Malaysian and Indonesian languages are known to share similarities. When Indonesian corpora are incorporated, the best results surged to 73.3% and 62.4% for Indonesian and Indonesian-English code-switch test sets, respectively. The results provide valuable insights into the effectiveness of transformer-based models for punctuation restoration in Indonesian and Indonesian-English code-switch speech. The findings contribute to advancing the field of punctuation restoration, supporting the possibility of cross-lingual knowledge transfer, and enhancing downstream spoken language processing tasks.
Alp ÖktemMireia FarrúsAntonio Bonafonte
Mehmet Efe YuzugulerC. Okan Sakar