Despite a plethora of prior explorations, conducting text adversarial attacks in practical settings is still challenging with the following constraints: black box -- the inner structure of the victim model is unknown; hard label -- the attacker only has access to the top-1 prediction results; and semantic preservation - the perturbation needs to preserve the original semantics. In this paper, we present PAT, a novel adversarial attack method employed under all these constraints. Specifically, PAT explicitly models the adversarial and non-adversarial prototypes and incorporates them to measure semantic changes for replacement selection in the hard-label black-box setting to generate high-quality samples. In each iteration, PAT finds original words that can be replaced back and selects better candidate words for perturbed positions in a geometry-aware manner guided by this estimation, which maximally improves the perturbation construction and minimally impacts the original semantics. Extensive evaluation with benchmark datasets and state-of-the-art models shows that PAT outperforms existing text adversarial attacks in terms of both attack effectiveness and semantic preservation. Moreover, we validate the efficacy of PAT against industry-leading natural language processing platforms in real-world settings.
Danyu XuPengchuan WangQianmu Li
Muxue LiangChuan WangSiyuan LiangAishan LiuYanan CaoQingyong LiZeming LiuLiang YangXiaochun Cao
Zhaorong LiuXi XiongYuanyuan LiYan YuJiazhong LuShuai ZhangFei Xiong
Chenhao LinSicong HanJiongli ZhuQian LiChao ShenYouwei ZhangXiaohong Guan
Hua ZhangJiahui WangHaoran GaoXin ZhangHuawei WangWenmin Li