Monocular depth estimation is a fundamental task in computer vision and multimedia. The self-supervised learning pipeline makes it possible to train the monocular depth network with no need of depth labels. In this paper, a multi-frame depth model with multi-scale feature fusion is proposed for strengthening texture features and spatial-temporal features, which improves the robustness of depth estimation between frames with large camera ego-motion. A novel dynamic object detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. Robust knowledge distillation with a consistent teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without an increase in computation complexity during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.
Qiqi KOUWei-Chen WangChenggong HANChen LÜDeqiang CHENGYing Ji
Guanghui WuHao LiuLongguang WangKunhong LiYulan GuoZengping Chen
Hongli HuJun MiaoGuanghui ZhuJie YanJun Chu