Archana ChaudhariPranav Kushare
Hinglish language is a hybrid mixture of Hindi and English text in Latin script. According to a survey by a company named milestone localization, 58 % of the population of India who are using technology prefer reading and sending social media messages in Hinglish. Sentence embeddings are basically the vector representation of that sentence in numerical form, which computers can understand. There are a lot of applications where embeddings are mandatory. The current state of the art sentence embedding model, SBERT (sentence Bert) is trained on specific English-supervised labeled datasets, STS (semantic textual similarity), and NLI (natural language inference). These datasets are in paired format and require time-consuming and expensive human annotations. So building these models for low-resourced languages like Hindi becomes hard because there aren't enough specialized datasets. The propsed work ia slight modification of unsupervised model simcse (Simple Contrastive Learning of Sentence Embeddings), which was introduced in 2021 by authors Tianyu Gao et al. The work revolves around finetuning an unsupervised embedding model on approximately 10,000 hinglish sentences. We evaluated a couple of transformers based on simcse with a Hiinglish transliterated version of standard STS data. Out of all the transformer distill-bert model performed best with a Sperman's similarity score of 0.60 A comparative analysis of the embedding quality of different transformer models is also discussed.
Zhengfeng ZhangPeichao LaiRuiqing WangFeiyang YeYilei Wang
Zhuofeng WuChaowei XiaoV. G. Vinod Vydiswaran
Tianyu ZongBingkang ShiHuimin YiJungang Xu