Evaluation is one of the major concerns when developing information retrieval systems. Especially in the field of conversational AI, this topic has been heavily studied in the setting of both non-task and task-oriented conversational agents (dialogue systems).[1] Recently, several automatic metrics e.g., BLEU and ROUGE, proposed for the evaluation of dialogue systems, have shown poor correlation with human judgment and are thus ineffective for the evaluation of dialogue systems. As a consequence, a significant amount of research relies on human evaluation to estimate the effectiveness of dialogue systems[1, 4}.
Weiwei SunShuyu GuoShuo ZhangPengjie RenZhumin ChenMaarten de RijkeZhaochun Ren
Atsumoto OhashiRyuichiro Higashinaka
Zhiqiang HuNancy F. ChenRoy Ka-Wei Lee
Clemencia SiroMohammad AliannejadiMaarten de Rijke