We propose ClusCSE, an unsupervised sentence embedding framework. Contrastive learning has been widely researched for learning universal sentence embeddings in natural language processing. Contrastive methods typically apply well-designed transformations to raw sentences to construct positive pairs and combine different raw sentences to construct negative pairs. Following the usual paradigm of contrastive learning, unsup-SimCSE advanced state-of-the-art unsupervised sentence embeddings by taking dropout as the minimal data augmentation strategy. Considering the training objective, unsup-SimCSE expects to maximize the similarity of positive pairwise instances while minimize the similarity of negative pairwise instances. Indeed, even different raw sentences could be highly semantically similar. Thus, simply reducing the similarity of negative pairwise embeddings is impractical. Sentence embeddings learned by unsup-SimCSE may contain false knowledge of relationships of different sentences. To alleviate it, we introduce online clustering to unsup-SimCSE and thus propose ClusCSE. Instead of just comparing sentences, ClusCSE also enforces consistency between cluster assignments, which makes the embeddings aware of similar sentence groups. Our evaluations on semantic textual similarity tasks demonstrate that our proposed ClusCSE achieves superior performance compared to unsup-SimCSE with higher average Spearman' s correlation of 1.19% on BERT-base.
Che LiuRui WangJinghua LiuJian SunFei HuangLuo Si
Archana ChaudhariPranav Kushare
Yung-Sung ChuangRumen DangovskiHongyin LuoYang ZhangShiyu ChangMarin SoljačićShang-Wen LiScott YihYoon KimJames Glass