f-Divergence Minimization for Sequence-Level Knowledge Distillation

Yuqiao Wen; Zichao Li; Wenyu Du; Lili Mou

doi:10.18653/v1/2023.acl-long.605

ScienceGate Book Chapters

JOURNAL ARTICLE

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Yuqiao Wen Zichao Li Wenyu Du Lili Mou

Year: 2023 Pages: 10817-10834

DOI: 10.18653/v1/2023.acl-long.605

Get Full-Text PDF Get Analytical Report

Abstract

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one.It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an FDISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our FDISTILL methods. We further derive step-wise decomposition for our FDISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

Keywords:

Divergence (linguistics) Distillation Sequence (biology) Computer science Minification Function (biology) Process (computing) Word (group theory) Decomposition Artificial intelligence Mathematics Programming language

Metrics

Cited By

2.81

FWCI (Field Weighted Citation Impact)

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Explainable Artificial Intelligence (XAI)

Physical Sciences → Computer Science → Artificial Intelligence

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Abstract

Metrics

Citation History

Topics

Related Documents

f-Divergence Minimization for Sequence-Level Knowledge Distillation

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Sequence-Level Knowledge Distillation

Knowledge distillation via instance-level sequence learning

Efficient Bangla OCR via Sequence-Level Knowledge Distillation