In this paper, we propose a MASTF methodology, which is a Multimodal Abstractive Summarization based on Transformer. Neural network models applied in the field of generative summaries utilizing conventional multi-modals were techniques utilizing hierarchical attention based on circulating neural networks. Although transformers showed excellent performance in various natural language processing fields, including generative summaries, there were no cases of application in multimodal-based generative summaries. Thus, in this paper, we use transformers to improve the performance of multimodal image subtitle generation summary models. Transformer-based models outperform hierarchical attention-based models by 24.17% on ROUGE-L basis and 10.52% on combining speech and text.
Shoaib HayatAvishek DasMohammed Moshiul Hoque
Shubham DhapolaShailendra GoelDaksh RawatSatvik VatsVikrant Sharma
Miracle AureliaSheila MonicaAbba Suganda Girsang
Mohamed TrabelsiHüseyin Uzunalioğlu