Guanghui WuHao LiuLongguang WangKunhong LiYulan GuoZengping Chen
Self-supervised multi-frame depth estimation outperforms single-frame approaches by utilizing not only appearance information, but also geometric information. A common practice for multi-frame methods is to employ feature-metric bundle adjustment (FBA) to refine depth map initialized from the single-frame prior. However, FBA cannot always provide effective residual updates due to unreliable matching costs, which are corrupted by thin texture, occlusion, and especially object motion. To tackle this problem, we propose a context-aware transformer (CAT) to refine the corrupted matching costs by leveraging the spatial context information. Specifically, the CAT adaptively aggregates matching costs according to the spatial affinity inferred from local appearance context, and produces reliable contextual costs for FBA. Moreover, we design a motion-aware regularization loss to provide supervision for regions with moving objects, making CAT competent for dynamic scenes. Extensive experiments and analyses on the KITTI and Cityscapes datasets demonstrate the effectiveness and superior generalization capability of our approach.
Shangshu YuMeiqing WuSiew-Kei LamChangshuo WangRuiping Wang
Jingsheng XuBo GuanJianchang ZhaoBo YiJianmin Li
Libo SunJia-Wang BianHuangying ZhanWei YinIan ReidChunhua Shen
Kaichen ZhouJia-Wang BianJian-Qing ZhengJia-Xing ZhongCharles XieNiki TrigoniAndrew Markham