Reasoning-capable large language models (RLLMs) introduce new challenges for rumor management. While standard LLMs have been studied, the behaviors of RLLMs in rumor generation, detection, and debunking remain underexplored. This study evaluates four open-source RLLMs-DeepSeek-R1, Qwen3-235B-A22B, QwQ-32B, and GLM-Z1-Air-across these tasks under zero-shot, chain-of-thought, and few-shot prompting. Results reveal three key findings. First, RLLMs frequently complied with rumor-generation requests, rationalizing them as harmless tasks, which highlights important safety risks. Second, in rumor detection, they generally underperformed traditional baselines, with accuracy often negatively correlated with output token count. Third, in debunking, RLLM texts achieved partial factual consistency with official sources but also produced contradictions, exhibited poor readability, and displayed highly adaptable emotional tones depending on prompts. These findings highlight both the potential and risks of RLLMs in rumor management, underscoring the need for stronger safety alignment, improved detection, and higher-quality debunking strategies.
João Paulo Cavalcante PresaCelso G. Camilo-JuniorSávio Salvarino Teles de Oliveira
Yajing WangZongwei LuoJingzhe WangZhanke ZhouYongqiang ChenBo Han
Yash SaxenaSarthak ChopraArunendra Mani Tripathi