Yongqiang DuHaoran LiuShengjie HeSongnan Chen
Blind image inpainting, the task of detecting corrupted regions with diverse patterns within an image and then generating plausible content for the corrupted regions, remains a both challenging and practical problem in computer vision. In this paper, we propose a novel model InViT for blind image inpainting, which leverages a combination of a pre-trained Generative Adversarial Network (GAN) and a learnable Vision Transformer (ViT). The proposed InViT mainly consists of two phases, the mask prediction phase and the image inpainting phase. Benefiting from the learned latent feature space from the full training data through GAN inversion, a pre-trained StyleGAN is able to provide reliable cues of corrupted regions for mask prediction. By further incorporating the predicted mask into the image inpainting phase, we design a vision Transformer with the mask-aware self-attention mechanism to capture long-range dependencies between pixels during content reconstruction. Besides, we propose a Prompt-augment Contextual Aggregation module to strengthen the reasonableness of generated content for the corrupted regions. Extensive experiments on several benchmark datasets for blind image inpainting demonstrate that our InViT model achieves state-of-the-art performance compared to existing methods in terms of both quantitative metrics and qualitative visual quality.
Seong-Joo KimJaeyoung ChoiJae-Young Choi
Chunyi LiYen‐Yu LinWei-Chen Chiu
Bin DongHui JiJia LiZuowei ShenYuhong Xu