The Language Model (LM) Detector has gained attention for its remarkable performance in detecting machine-generated texts. It however remains unclear how this detector would perform against different adversarial attacks. In this paper, we aim to address this question by conducting a systematic analysis on the resilience of the LM Detector against eight black-box adversarial attack methods. We also propose a new technique, called StrictPWWS that introduces the semantic similarity constraint into the conventional Probability Weighted Word Saliency (PWWS). Our finding reveals that the selection of a search algorithm helps the attack methods generate better adversarial samples that can bypass the LM Detector. Moreover, tightening linguistic constraints emerges as an effective way to improve the attack success rate. StrictPWWS demonstrates achieving superior performance compared to other adversarial attack methods.
Maha AlqhtaniDaniyal AlghazzawiSuaad Alarifi
Manjushree B. AithalXiaohua Li
James Lee HuMohammadreza EbrahimiHsinchun Chen