Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao

doi:10.18653/v1/2022.emnlp-main.697

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao

Year: 2022 Pages: 10236-10242

DOI: 10.18653/v1/2022.emnlp-main.697

Get Full-Text PDF Get Analytical Report

Abstract

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention’s performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.

Keywords:

Computer science Autoregressive model Transformer Security token Computational complexity theory Artificial intelligence Kernel (algebra) Algorithm Mathematics Voltage Engineering

Metrics

Cited By

0.20

FWCI (Field Weighted Citation Impact)

Refs

0.57

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Machine Learning in Healthcare

Physical Sciences → Computer Science → Artificial Intelligence

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Tuning Pre-Trained Transformers for Climate Claim Verification

Improving Pre- Trained Weights through Meta - Heuristics Fine- Tuning

Fine-Tuning Generative Pre-Trained Transformers for Clinical Dialogue Summarization

Improving Pantun Generator Performance with Fine Tuning Generative Pre-Trained Transformers

Vision Transformers for Galaxy Morphology Classification: Fine-Tuning Pre-trained Networks vs. Training from Scratch