Accessing large-scale external knowledge while maintaining a consistent understanding of real world entities is essential for modern natural language processing (NLP) systems. This thesis investigates two fundamental capabilities that support this objective: knowledge-intensive language processing, which enables models to retrieve and integrate external information, and entity centric language understanding, which facilitates identifying, linking, and reasoning about entities in context.We first explore knowledge-intensive language processing through the lens of retrieval-based methods. We present a theoretical and empirical analysis of hard negatives in the Noise Contrastive Estimation (NCE) training objective, improve multi-task retrieval by promoting task specialization and propose a retrieval-augmented generation framework that allows models to express their information needs implicitly, eliminating the need for human-specified queries.Next, we focus on entity-centric language understanding. We introduce a novel approach that reframes entity linking as an inverse open-domain question answering problem, addressing the challenge of predicting mentions without knowing their corresponding entities, and naturally extending NCE to support multi-label retrieval. We also propose a simple yet effective sequence-to-sequence model for coreference resolution, which maps input text to linearized coreference annotations and achieves strong performance with no task-specific model design.These contributions advance the development of NLP systems that can reason more effectively over external knowledge and entities, enabling stronger performance on a wide range of information-seeking and understanding tasks.
Valerio BellandiChristian BernasconiFausto LodiMatteo PalmonariRiccardo PozziMarco RipamontiStefano Siccardi
Heng ZhangChengzhi ZhangYuzhuo Wang