Abstract Pedestrian trajectory prediction from egocentric monocular video is hindered by camera motion, intermittent occlusions, and complex social interactions. We present NIM-STGCN, a unified framework whose core contribution is a differentiable view normalization(GVN) that couples an enhanced differentiable PnP layer (ED-PnP) with an $$\textrm{SE}(3)$$ SE ( 3 ) warp to align past observations into a single virtual static camera frame. Because GVN is trained end-to-end, forecasting losses back-propagate to pose estimation, yielding geometrically cleaner inputs. On the normalized histories, a lightweight Gated Convolutional Imputation Module (GCIM) recovers missing bounding-box measurements while preserving observed entries, and an efficient spatio-temporal GCN encodes agent dynamics and interactions (optionally augmented by a physics-guided kinematics–interaction prior, PKIM). A Gaussian-mixture predictor produces multi-modal futures and is optimized with a sequence-level negative log-likelihood together with a time-weighted position loss. Extensive experiments on the JAAD and PIE benchmarks show that NIM-STGCN reduces Average Displacement Error (ADE) and Final Displacement Error (FDE) by 12–18 % compared to state-of-the-art methods. Code is available at https://github.com/fantot/NIM-STGCN . Graphical abstract
Amar FadillahChing-Lin LeeZhixuan WangKuan-Ting Lai
Yu-Jin KimEunbin SeoChiyun NohKyongsu Yi
Ruochen NiuChuan HuBiao YangHao ChenZ. Dai
Bogdan Ilie SighenceaIon Rareș StanciuCătălin Daniel Căleanu
Heng WangGuijuan ZhangHong LiuDianjie Lu