One of the challenges in NLP tasks, such as text-to-SQL semantic parsing, is generalization. In the text-to-SQL task, having separate training and testing data can measure one aspect of the generalization: how well the model generalizes to unseen databases. Other aspects, however, remain unaccounted for. We propose a new dataset and a more challenging and thorough evaluation process that focuses on the two challenges of generalizing the text-to-SQL model: database content references and question patterns. We create SPIDER-QG, an augmented dataset that employs three techniques, to assess generalizability. First, we replace the set of values in the existing test set with other values from the same column in the same database. Second, we use the synonym of each value as a replacement instead. Third, we generate new questions for the existing SQL query by back-translating the original question. Our evaluation setup demonstrates the generalization challenges and struggles of the current models.
Ben EyalMoran MahabiOphir HarocheAmir BacharMichael Elhadad
Linzheng ChaiDongling XiaoYan ZhaoJian YangLiqun YangQianwen ZhangYunbo CaoZhoujun Li
Natthawat TungruethaipakSantitham Prom–on
Shijie ChenZiru ChenHuan SunYu Su
Tomer WolfsonDaniel DeutchJonathan Berant