JOURNAL ARTICLE

Transformer-Based Punctuation Restoration Models for Indonesian with English Codeswitching Speech Transcripts

Abstract

Punctuation restoration (PR) is a crucial task for improving the human readability and usability of speech transcripts in spoken language processing. However, to the best of our knowledge, there is no punctuation restoration model available for Indonesian speech transcripts, especially when it comes to Indonesian-English code-switch speech. This paper introduces transformer-based models that restore punctuations for Indonesian with English code-switch speech transcripts. The study investigates the ability to perform cross-lingual punctuation restoration for Indonesian speech using a model trained with a Malaysian-English codeswitch dataset. Subsequently, models trained on datasets of Indonesian, Indonesian-English codeswitch, and the combination of them plus the Malaysian-English codeswitch dataset are compared. Experimental results show that the cross-lingual model achieved the F1 scores of 40.7% and 41.9% on Indonesian and Indonesian-English codeswitch test sets, respectively. The results are surprisingly low when no Indonesian data has been used for training, even though Malaysian and Indonesian languages are known to share similarities. When Indonesian corpora are incorporated, the best results surged to 73.3% and 62.4% for Indonesian and Indonesian-English code-switch test sets, respectively. The results provide valuable insights into the effectiveness of transformer-based models for punctuation restoration in Indonesian and Indonesian-English code-switch speech. The findings contribute to advancing the field of punctuation restoration, supporting the possibility of cross-lingual knowledge transfer, and enhancing downstream spoken language processing tasks.

Keywords:
Punctuation Indonesian Computer science Transformer Natural language processing Artificial intelligence Linguistics Speech recognition Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
23
Refs
0.19
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.