Applying Positional Encoding to Enhance Vision-Language Transformers

Xuehao Liu; Sarah Jane Delany; Susan McKeever

doi:10.5220/0011796100003417

ScienceGate Book Chapters

JOURNAL ARTICLE

Applying Positional Encoding to Enhance Vision-Language Transformers

Xuehao Liu Sarah Jane Delany Susan McKeever

Year: 2023 Pages: 838-845

DOI: 10.5220/0011796100003417

Get Full-Text PDF Get Analytical Report

Abstract

Positional encoding is used in both natural language and computer vision transformers. It provides information on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance. Unlike the pure language and vision transformers, vision-language transformers do not currently exploit positional encoding schemes to enrich input information. We show that capturing location information of visual features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding. We use image captioning as a downstream task to test performance. We added two types of positional encoding into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding. With the same training protocol and data, both positional encodings improved the image captioning performance of Oscar by between 6.8% to 24.1% across five image captioning evaluation criteria used.

Keywords:

Computer science Encoding (memory) Transformer Computer vision Artificial intelligence Electrical engineering Engineering Voltage

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.03

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Tactile and Sensory Interactions

Life Sciences → Neuroscience → Cognitive Neuroscience

Applying Positional Encoding to Enhance Vision-Language Transformers

Abstract

Metrics

Topics

Related Documents

POSITIONAL ENCODING FOR TRANSFORMERS

A super-pixel slicing enhanced positional encoding for vision transformers

Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding