The process of semantic sentence embedding is a pivotal component in the field of natural language processing. This procedure involves the transformation of sentences from discrete linguistic spaces into a high-dimensional vector space. Such transformations are designed to retain the semantic integrity of the original sentences, thereby proving to be instrumental in numerous downstream applications. The core component of the embedding system is its embedding model. Initial research on sentence embeddings predominantly adopted the concepts of word2vec while recent advancements utilize pre-trained language models (PLMs) for sentence embedidngs. More recently, the application of contrastive learning techniques to sentence embedding learning has emerged. This approach learns representations by drawing together positive examples while distancing negative ones. Consequently, numerous sentence representation methods based on contrastive learning have been developed and have progressively become the prevalent approach for sentence embeddings. Despite their efficacy, several challenges and issues persist that need to be addressed. A key limitation of contemporary embedding methodologies is their dependence on supervised data, which is often scarce or non-existent in the context of in-domain settings. This absence of in-domain data hinders the fine-tuning of embeddings for specific tasks or domains, thereby diminishing their transferability in out-of-distribution scenarios. Additionally, contrastive learning methods are often plagued by a representation degeneration issue, where the embeddings may fail to encapsulate the complete semantic information of a sentence. This constraint inhibits the model’s representational ability, potentially leading to subpar performance in subsequent NLP tasks. Moreover, since semantics is intrinsically linked to human comprehension of knowledge, it poses a challenge to devise analytical solutions for enhancing model performance. The crux of the problem lies in formulating a more efficacious learning framework for sentence embeddings. Lastly, aligning the learning framework more closely with human semantic understanding remains a long-term objective. In this thesis, we aim to tackle the aforementioned issues individually. Firstly, we examine the transferability properties of sentence embeddings in unsupervised settings and out-of-domain applications for the first time, discovering that current popular methods exhibit low transferability. As a result, we introduce a novel framework, BlendCSE, designed to generate embeddings with enhanced transferability. Secondly, we enhance recent contrastive learning-based sentence embeddings by investigating their augmentation strategies and addressing feature collapse problems. We pinpoint the noise issue in embeddings and the rank bottleneck issue in data samples, proposing a dimensional contrastive learning approach to tackle these challenges. Finally, we explore the knowledge distillation strategy to further improve the baseline performance of sentence embeddings. During our examination, we observe that the standard knowledge distillation framework yields only marginal improvement over the distillation teacher, which arises from similarity logits and results in overfitting issues. To mitigate such variance, we put forward a Group-P shuffling regulation strategy and a teacher ensemble strategy. Both approaches significantly decrease the loss gap between training and testing sets, indicating their robust capability to prevent overfitting. In conclusion, this dissertation recognizes the issues of transferability, feature collapse, and overfitting associated with the conventional embedding model. It suggests multiple strategies to counteract these challenges. Empirical evidence highlights the effectiveness of these proposed methods in resolving the identified problems, leading to a substantial enhancement in the baseline performance.
Junlei ZhangZhenzhong LanJunxian He
Tianyu GaoXingcheng YaoDanqi Chen