Image captioning employs artificial intelligence to translate visual content into natural language text descriptions. Underwater image captioning offers specialized interpretation for scenarios such as underwater environmental monitoring, underwater archaeology, and offshore platforms. It proves effective in compressing information for the real-time transmission of extensive underwater images via underwater acoustic communication. In this article, we annotate underwater image caption dataset for this task, and create a baseline using the encoder-decoder neural image caption model. It output complete sentences related to image content. The description of underwater images mainly focuses on the underwater scene and objects. The object detection model based on the Faster RCNN is applied to extract the full-image features and regional features corresponding to the target in the image. For the caption model, we enhanced the input features of the language generator by combining global information, regional details, contextual cues, and pre-ordered text information through feature fusion. It enables the generator to output precise semantic expressions related to salient objects. The method was applied to the annotated underwater image caption dataset, resulting in more accurate descriptions of underwater targets compared to sentences generated by a basic neural network model. The evaluation metrics reflected higher scores, affirming the effectiveness of our approach.
Yunhan LiJingjing LouChuan YePengfei ZhengHaijun WuMinxiu Guan
Qingqing LuXiaomei ZhangXin KangFuji RenKaren SimonyanA ZissermanSzegedyIoffe VanhouckeWojna ShlensX HeS ZhangJ RenSuX ChenC ZitnickX JiaE GavvesB FernandoT TuytelaarsO VinyalsA ToshevS BengioD ErhanT GuanY WangL DuanR JiX ShiY ShaoA KarpathyL Fei-FeiP JiangF RenN ZhengX WangM PengL PanM HuC JinF RenQ YouH JinZ WangC FangJ LuoK XuJ BaR KirosK ChoA CourvilleR SalakhutdinovR ZemelY BengioJ LuC XiongD ParikhR SocherM GrubingerP CloughX HeS ZhangJ RenSu
Caixia MengZhichao BaoYanzhao Zhang