Zhicheng GengLuming LiangTianyu DingIlya Zharkov
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. The existing methods based on Convolutional Neural Network (CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architec-tures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spa-tial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explic-itly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the in-put LFR and LR frames, which is then utilized in the de-coder part to synthesize the HFR and HR frames. compared with the state-of-the-art TMNet [54], our network is 60% smaller (4.5M vs 12.3M parameters) and 80% faster (26.2fps vs 14.3fps on 720 x 576 frames) without sacri-ficing much performance. The source code is available at https://github.com/llmpass/RSTT.
Yaping QiRui SuLei ChenXianye BenZheng DongHongchao Zhou
Zeyu XiaoZhiwei XiongXueyang FuDong LiuZheng-Jun Zha