Rui ZhaoZhenwei ShiZhengxia Zou
Automatically generating language descriptions of remote sensing images has become an emerging research hot spot in the remote sensing field. Attention-based captioning, as a representative group of recent deep learning-based captioning methods, shares the advantage of generating the words while highlighting corresponding object locations in the image. Standard attention-based methods generate captions based on coarse-grained and unstructured attention units, which fails to exploit structured spatial relations of semantic contents in remote sensing images. Although the structure characteristic makes remote sensing images widely divergent to natural images and poses a greater challenge for the remote sensing image captioning task, the key of most remote sensing captioning methods is usually borrowed from the computer vision community without considering the domain knowledge behind. To overcome this problem, a fine-grained, structured attention-based method is proposed to utilize the structural characteristics of semantic contents in high-resolution remote sensing images. Our method learns better descriptions and can generate pixelwise segmentation masks of semantic contents. The segmentation can be jointly trained with the captioning in a unified framework without requiring any pixelwise annotations. Evaluations are conducted on three remote sensing image captioning benchmark data sets with detailed ablation studies and parameter analysis. Compared with the state-of-the-art methods, our method achieves higher captioning accuracy and can generate high-resolution and meaningful segmentation masks of semantic contents at the same time.
Ammara SattarMuhammad AssamTahani Jaser AlahmadiUzair Aslam BhattiHao TangMuhammad Aamir
Xiaoqiang LuBinqiang WangXiangtao Zheng