Reyjohn R. FriasRuji P. MedinaAriel M. Sison
Bilingualism is a common linguistic phenomenon that causes a challenge in opinion mining. The early methods in Cross-lingual Sentiment Analysis (CLSA), based on machine translation, parallel corpus, and bilingual sentiment lexicon, face issues in terms of translation error, vocabulary coverage, and dependence on extensive parallel data. Hence, this study examined the effectiveness of Cross-lingual Word Embedding (CLWE) for the sentiment analysis of code-mixed Filipino-English corpus. A large-scale manually annotated code-mixed dataset containing stakeholders' feedback on the Higher Education Institutions' services and infrastructure was developed to address resource scarcity. Several pre-trained transformer-based CLWE methods, such as mBERT, XLM-R, and XLM-T, were employed to represent the words from the two languages in the same vector space and obtain the cross-lingual embeddings. An Attention-based BiLSTM-CNN neural architecture, the baseline model from the previous work, was fine-tuned on these cross-lingual embeddings to perform the sentiment analysis of code-mixed Filipino-English corpus. The experimental results demonstrate that XLM-T has achieved the highest performance rate, with 91.30% accuracy, 90.36% precision, 90.92% recall, and 90.61% F1-score. Thus, employing cross-lingual word embedding was proven effective as it significantly increases the accuracy by up to 10.02% compared to the baseline model, which only uses word embedding having no cross-lingual alignment.
Qiang ChenWenjie LiYu LeiXule LiuChuwei LuoYanxiang He
Babe SultanaKhondaker A. Mamun
Anders SøgaardIvan VulićSebastian RuderManaal Faruqui
Raki LachrafEl Moatez Billah NagoudiYoucef AyachiAhmed AbdelalíDidier Schwab