JOURNAL ARTICLE

WinStat: A Family of Trainable Positional Encodings for Transformers in Time Series Forecasting

Cristhian Moya-MotaIgnacio Aguilera-MartosDiego García‐GilJulián Luengo

Year: 2025 Journal:   Machine Learning and Knowledge Extraction Vol: 8 (1)Pages: 7-7   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Transformers for time series forecasting rely on positional encoding to inject temporal order into the permutation-invariant self-attention mechanism. Classical sinusoidal absolute encodings are fixed and purely geometric; learnable absolute encodings often overfit and fail to extrapolate, while relative or advanced schemes can impose substantial computational overhead without being sufficiently tailored to temporal data. This work introduces a family of window-statistics positional encodings that explicitly incorporate local temporal semantics into the representation of each timestamp. The base variant (WinStat) augments inputs with statistics computed over a sliding window; WinStatLag adds explicit lag-difference features; and hybrid variants (WinStatFlex, WinStatTPE, WinStatSPE) learn soft mixtures of window statistics with absolute, learnable, and semantic positional signals, preserving the simplicity of additive encodings while adapting to local structure and informative lags. We evaluate proposed encodings on four heterogeneous benchmarks against state-of-the-art proposals: Electricity Transformer Temperature (hourly variants), Individual Household Electric Power Consumption, New York City Yellow Taxi Trip Records, and a large-scale industrial time series from heavy machinery. All experiments use a controlled Transformer backbone with full self-attention to isolate the effect of positional information. Across datasets, the proposed methods consistently reduce mean squared error and mean absolute error relative to a strong Transformer baseline with sinusoidal positional encoding and state-of-the-art encodings for time series, with WinStatFlex and WinStatTPE emerging as the most effective variants. Ablation studies that randomly shuffle decoder inputs markedly degrade the proposed methods, supporting the conclusion that their gains arise from learned order-aware locality and semantic structure rather than incidental artifacts. A simple and reproducible heuristic for setting the sliding-window length—roughly one quarter to one third of the input sequence length—provides robust performance without the need for exhaustive tuning.

Keywords:
Transformer Encoding (memory) Series (stratigraphy) Computation Heuristic Pattern recognition (psychology) Overfitting Time series

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Time Series Analysis and Forecasting
Physical Sciences →  Computer Science →  Signal Processing
Traffic Prediction and Management Techniques
Physical Sciences →  Engineering →  Building and Construction
Machine Learning in Healthcare
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.