H. HoangTung D. LeNguyen Tien Huy
Recent advancements in computer vision and natural language processing were applied to the Visual Question Answering task. Nonetheless, a significant proportion of models exhibiting high accuracy possess extensive architectural components. This has a significant impact on the process of bringing the technology to practical applications such as assistive devices for the blind and visually impaired, and other related fields. Our research focuses on compressing the Visual Question Answering model on the Vietnamese dataset by utilizing the knowledge distillation method. Furthermore, in order to enhance precision, we have also developed a Mixture of ViVQA Experts system that will adapt to each type of question for improving accuracy while increasing only a few parameters and not wasting time retraining the entire system from scratch. With a total of 204M parameters, this approach has reduced the size by 24.51% compared to the original model while only reducing accuracy by 6.59\% on the overall test set. More specifically, we have made accuracy improvements on each question type: "number" increased by 1.35% and "color" increased by 0.48\% compared to our distillation model. The code and pretrained models are available at: anonymous.
H. HoangTung D. LeNguyen Tien Huy
Tianyu HuaiJie ZhouQin ChenQingchun BaiZe ZhouXipeng QiuLiang He
Qishen ChenWenxuan HeXingyuan ChenChen ChengMinjie BianHuahu Xu
Gang LiuJinlong HePengfei LiShenjun ZhongHongyang LiGenrong He
Zhiwei FangJing LiuQu TangYong LiHanqing Lu