Wei WangLiangzhu GeJingqiao ZhangCheng Yang
Following SimCSE, contrastive learning based methods have achieved the\nstate-of-the-art (SOTA) performance in learning sentence embeddings. However,\nthe unsupervised contrastive learning methods still lag far behind the\nsupervised counterparts. We attribute this to the quality of positive and\nnegative samples, and aim to improve both. Specifically, for positive samples,\nwe propose switch-case augmentation to flip the case of the first letter of\nrandomly selected words in a sentence. This is to counteract the intrinsic bias\nof pre-trained token embeddings to frequency, word cases and subwords. For\nnegative samples, we sample hard negatives from the whole dataset based on a\npre-trained language model. Combining the above two methods with SimCSE, our\nproposed Contrastive learning with Augmented and Retrieved Data for Sentence\nembedding (CARDS) method significantly surpasses the current SOTA on STS\nbenchmarks in the unsupervised setting.\n
Wenxiao LiuZihong YangChaozhuo LiZijin HongJianfeng MaZhiquan LiuLitian ZhangFeiran Huang
Zhangchi FengRichong ZhangZhijie Nie
Qinyuan ChengXiaogui YangTianxiang SunLinyang LiXipeng Qiu