Contrastive Learning Based Unsupervised Sentence Embeddings for Hinglish

Archana Chaudhari; Pranav Kushare

doi:10.1109/iccubea58933.2023.10392056

ScienceGate Book Chapters

JOURNAL ARTICLE

Contrastive Learning Based Unsupervised Sentence Embeddings for Hinglish

Archana Chaudhari Pranav Kushare

Year: 2023 Vol: 04805 Pages: 1-5

DOI: 10.1109/iccubea58933.2023.10392056

Get Full-Text PDF Get Analytical Report

Abstract

Hinglish language is a hybrid mixture of Hindi and English text in Latin script. According to a survey by a company named milestone localization, 58 % of the population of India who are using technology prefer reading and sending social media messages in Hinglish. Sentence embeddings are basically the vector representation of that sentence in numerical form, which computers can understand. There are a lot of applications where embeddings are mandatory. The current state of the art sentence embedding model, SBERT (sentence Bert) is trained on specific English-supervised labeled datasets, STS (semantic textual similarity), and NLI (natural language inference). These datasets are in paired format and require time-consuming and expensive human annotations. So building these models for low-resourced languages like Hindi becomes hard because there aren't enough specialized datasets. The propsed work ia slight modification of unsupervised model simcse (Simple Contrastive Learning of Sentence Embeddings), which was introduced in 2021 by authors Tianyu Gao et al. The work revolves around finetuning an unsupervised embedding model on approximately 10,000 hinglish sentences. We evaluated a couple of transformers based on simcse with a Hiinglish transliterated version of standard STS data. Out of all the transformer distill-bert model performed best with a Sperman's similarity score of 0.60 A comparative analysis of the embedding quality of different transformer models is also discussed.

Keywords:

Computer science Natural language processing Artificial intelligence Transformer Sentence Embedding Language model

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.20

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Contrastive Learning Based Unsupervised Sentence Embeddings for Hinglish

Abstract

Metrics

Topics

Related Documents

NCSE: Neighbor Contrastive Learning for Unsupervised Sentence Embeddings

HiCL: Hierarchical Contrastive Learning of Unsupervised Sentence Embeddings

UNSEE: Unsupervised Non-contrastive Sentence Embeddings

PCL: Peer-Contrastive Learning with Diverse Augmentations for Unsupervised Sentence Embeddings

TNCSE: Tensor Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings