Transformer-Based Punctuation Restoration Models for Indonesian with English Codeswitching Speech Transcripts

Changsong Liu; Ho Thi Nga; Yip Jia Qi; Eng Siong Chng

doi:10.1109/icaicta59291.2023.10390242

ScienceGate Book Chapters

JOURNAL ARTICLE

Transformer-Based Punctuation Restoration Models for Indonesian with English Codeswitching Speech Transcripts

Changsong Liu Ho Thi Nga Yip Jia Qi Eng Siong Chng

Year: 2023 Pages: 1-6

DOI: 10.1109/icaicta59291.2023.10390242

Get Full-Text PDF Get Analytical Report

Abstract

Punctuation restoration (PR) is a crucial task for improving the human readability and usability of speech transcripts in spoken language processing. However, to the best of our knowledge, there is no punctuation restoration model available for Indonesian speech transcripts, especially when it comes to Indonesian-English code-switch speech. This paper introduces transformer-based models that restore punctuations for Indonesian with English code-switch speech transcripts. The study investigates the ability to perform cross-lingual punctuation restoration for Indonesian speech using a model trained with a Malaysian-English codeswitch dataset. Subsequently, models trained on datasets of Indonesian, Indonesian-English codeswitch, and the combination of them plus the Malaysian-English codeswitch dataset are compared. Experimental results show that the cross-lingual model achieved the F1 scores of 40.7% and 41.9% on Indonesian and Indonesian-English codeswitch test sets, respectively. The results are surprisingly low when no Indonesian data has been used for training, even though Malaysian and Indonesian languages are known to share similarities. When Indonesian corpora are incorporated, the best results surged to 73.3% and 62.4% for Indonesian and Indonesian-English code-switch test sets, respectively. The results provide valuable insights into the effectiveness of transformer-based models for punctuation restoration in Indonesian and Indonesian-English code-switch speech. The findings contribute to advancing the field of punctuation restoration, supporting the possibility of cross-lingual knowledge transfer, and enhancing downstream spoken language processing tasks.

Keywords:

Punctuation Indonesian Computer science Transformer Natural language processing Artificial intelligence Linguistics Speech recognition Engineering

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.19

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Transformer-Based Punctuation Restoration Models for Indonesian with English Codeswitching Speech Transcripts

Abstract

Metrics

Topics

Related Documents

Visualizing punctuation restoration in speech transcripts with prosograph

LSTM for punctuation restoration in speech transcripts

Transformer-Based Punctuation Restoration for Automatic Speech Recognition Systems

Punctuation Restoration for Speech Transcripts using seq2seq Transformers

Transformer Based Punctuation Restoration for Turkish