Schema Matching using Pre-Trained Language Models

Yunjia Zhang; Avrilia Floratou; Joyce Cahoon; Subru Krishnan; Andreas Müller; Dalitso Banda; Fotis Psallidas; Jignesh M. Patel

doi:10.1109/icde55515.2023.00123

ScienceGate Book Chapters

JOURNAL ARTICLE

Schema Matching using Pre-Trained Language Models

Yunjia Zhang Avrilia Floratou Joyce Cahoon Subru Krishnan Andreas Müller Dalitso Banda Fotis Psallidas Jignesh M. Patel

Year: 2023 Pages: 1558-1571

DOI: 10.1109/icde55515.2023.00123

Get Full-Text PDF Get Analytical Report

Abstract

Schema matching over relational data has been studied for more than two decades. However, the state-of-the-art methods do not address key modern-day challenges encountered in real customer scenarios, namely: 1) no access to the source (customer) data due to privacy constraints, 2) target schema with a much larger number of entities and attributes compared to the source schema, and 3) different but semantically equivalent entity and attribute names in the source and target schemata. In this paper, we address these shortcomings. Using real-world customer schemata, we demonstrate that existing linguistic matching approaches have low accuracy. Next, we propose the Learned Schema Mapper (LSM), a novel linguistic schema matching system that leverages the natural language understanding capabilities of pre-trained language models to improve the overall accuracy. Combining this with active learning and a smart attribute selection strategy that selects the most informative attributes for users to label, LSM can significantly reduce the overall human labeling cost. Experimental results demonstrate that users can correctly match their full schema while saving as much as 81% of the labeling cost compared to manual labeling.

Keywords:

Computer science Schema matching Schema (genetic algorithms) Database schema Artificial intelligence Natural language processing Matching (statistics) Information retrieval Natural language Machine learning Data mining Data integration Database design

Metrics

Cited By

4.60

FWCI (Field Weighted Citation Impact)

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Text and Document Classification Technologies

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Schema Matching using Pre-Trained Language Models

Abstract

Metrics

Citation History

Topics

Related Documents

Schema-Agnostic Entity Matching using Pre-trained Language Models

Schema matching based on energy domain pre-trained language model

Deep entity matching with pre-trained language models

Pre-trained Language Models

Pre-trained Language Models