JOURNAL ARTICLE

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Abstract

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one.It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an FDISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our FDISTILL methods. We further derive step-wise decomposition for our FDISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

Keywords:
Distillation Divergence (linguistics) Minification Process (computing) Decomposition Natural language

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.50
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Prenatal Screening and Diagnostics
Health Sciences →  Medicine →  Pediatrics, Perinatology and Child Health
Nuclear Structure and Function
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Diverse Scientific and Economic Studies
Social Sciences →  Economics, Econometrics and Finance →  Economics and Econometrics
© 2026 ScienceGate Book Chapters — All rights reserved.