JOURNAL ARTICLE

Contrastive Learning Based Unsupervised Sentence Embeddings for Hinglish

Abstract

Hinglish language is a hybrid mixture of Hindi and English text in Latin script. According to a survey by a company named milestone localization, 58 % of the population of India who are using technology prefer reading and sending social media messages in Hinglish. Sentence embeddings are basically the vector representation of that sentence in numerical form, which computers can understand. There are a lot of applications where embeddings are mandatory. The current state of the art sentence embedding model, SBERT (sentence Bert) is trained on specific English-supervised labeled datasets, STS (semantic textual similarity), and NLI (natural language inference). These datasets are in paired format and require time-consuming and expensive human annotations. So building these models for low-resourced languages like Hindi becomes hard because there aren't enough specialized datasets. The propsed work ia slight modification of unsupervised model simcse (Simple Contrastive Learning of Sentence Embeddings), which was introduced in 2021 by authors Tianyu Gao et al. The work revolves around finetuning an unsupervised embedding model on approximately 10,000 hinglish sentences. We evaluated a couple of transformers based on simcse with a Hiinglish transliterated version of standard STS data. Out of all the transformer distill-bert model performed best with a Sperman's similarity score of 0.60 A comparative analysis of the embedding quality of different transformer models is also discussed.

Keywords:
Computer science Natural language processing Artificial intelligence Transformer Sentence Embedding Language model

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
26
Refs
0.20
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.