This paper seeks to quantify the human-AI value alignment in large language models. Alignment between humans and AI has become a critical area of research to mitigate potential harm posed by AI. In tandem with this need, developers have incorporated a values-based approach towards model development where ethical principles are integrated from its inception. However, ensuring that these values are reflected in outputs remains a challenge. In addition, studies have noted that models lack consistency when producing outputs, which in turn can affect their function. Such variability in responses would impact human-AI value alignment as well, particularly where consistent alignment is critical. Fundamentally, the task of uncovering a model’s alignment is one of explainability – where understanding how these complex models behave is essential in order to assess their alignment. This paper examines the problem through a case study of GPT-3.5. By repeatedly prompting the model with scenarios based on a dataset of moral stories, we aggregate the model’s alignment with human values to produce a human-AI value alignment metric. Moreover, by using a comprehensive taxonomy of human values, we uncover the latent value profile represented by these outputs, thereby determining the extent of human-AI value alignment.
Giulio Antonio AbboSerena MarchesiAgnieszka WykowskaTony Belpaeme
Jiahao WangSongkai XueJinghui LiXiaonan Wang
Zhaowei ZhangCeyao ZhangNian LiuSiyuan QiZiqi RongSong‐Chun ZhuYaodong Yang
Changcheng JuWeijie ShiChengzhong LiuJiaming JiJipeng ZhangRuiyuan ZhangJiajie XuYaodong YangSirui HanYike Guo
Haoran YeYuhang XieYuanyi RenHanjun FangXin ZhangGuojie Song