Deep learning (DL) is revolutionizing image and video processing and now holds state-of-the-art performance in many tasks. However, video compression has so far resisted the DL revolution. Current attempts rely on complex solutions, interconnecting multiple networks to mimic the different layers of conventional codecs. Since DL approaches usually excel when the models are allowed to learn their own feature set, a different solution is herein proposed: end-to-end learning of a single network, explicitly avoiding motion estimation/prediction. We formalize it as the rate-distortion optimization of a single spatio-temporal autoencoder, by jointly learning a latent-space projection transform, and a synthesis transform for low-bitrate video compression. The quantizer uses a rounding scheme, relaxed during training, and an entropy estimation technique to enforce an information bottleneck. The obtained video compression network shows competitive performance against standard codecs (MPEG-4 Part 2, H.264/AVC, H.265/HEVC), particularly for low bitrates, even while avoiding the use of any motion prediction/compensation method.
Zhaobin ZhangYue LiKai ZhangLi ZhangYuwen He
Kejun WuZhenxing LiYou YangQiong LiuXiaoping Zhang
Wenxuan GuoShuo DuHuiyuan DengZikang YuLin Feng
Alexey A. GritsenkoXuehan XiongJosip DjolongaMostafa DehghaniChen SunMario LučićCordelia SchmidAnurag Arnab