ZTE Communications ›› 2025, Vol. 23 ›› Issue (3): 15-26.DOI: 10.12142/ZTECOM.202503003

• Special Topic • Previous Articles     Next Articles

VOTI: Jailbreaking Vision-Language Models via Visual Obfuscation and Task Induction

ZHU Yifan, CHU Zhixuan(), REN Kui   

  1. Zhejiang University, Hangzhou 310027, China
  • Received:2025-07-25 Online:2025-09-11 Published:2025-09-11
  • About author:ZHU Yifan received her BE degree from the School of Cyber Science and Technology, Sun Yat-Sen University, China in 2025. She is currently pursuing her ME degree at the School of Cyber Science and Technology, Zhejiang University, China. Her research interests include the security of multimodal large language models and safety alignment.
    CHU Zhixuan (zhixuanchu@zju.edu.cn) is a research professor and PhD supervisor of Zhejiang University, China. He received his PhD from the University of Georgia, USA and previously worked at Alibaba and Ant Group. His research focuses on secure and trustworthy large models, particularly the safe and reliable applications of large language models and multimodal models in vertical domains. He has published over 50 papers in top-tier journals and conferences in AI, data mining, and databases, including NeurIPS, ICLR, IJCAI, AAAI, ACL, KDD, ICDE, CCS, TNNLS, and more.
    REN Kui is a Qiushi Chair Professor and the dean of the College of Computer Science and Technology of Zhejiang University, China, where he is also the executive deputy director of the State Key Laboratory of Blockchain and Data Security. He is mainly engaged in research of data security and privacy protection, AI security, and security in intelligent devices and vehicular networks. He has published over 400 peer-reviewed journal and conference articles, with an H-Index of 100 and more than 54 000 citations. He is a Fellow of AAAS, ACM, CCF, and IEEE.

Abstract:

In recent years, large vision-language models (VLMs) have achieved significant breakthroughs in cross-modal understanding and generation. However, the safety issues arising from their multimodal interactions become prominent. VLMs are vulnerable to jailbreak attacks, where attackers craft carefully designed prompts to bypass safety mechanisms, leading them to generate harmful content. To address this, we investigate the alignment between visual inputs and task execution, uncovering locality defects and attention biases in VLMs. Based on these findings, we propose VOTI, a novel jailbreak framework leveraging visual obfuscation and task induction. VOTI subtly embeds malicious keywords within neutral image layouts to evade detection, and breaks down harmful queries into a sequence of subtasks. This approach disperses malicious intent across modalities, exploiting VLMs’ over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms. Implemented as an automated framework, VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies. Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness, achieving a 73.46% attack success rate on GPT-4o-mini. These results reveal critical vulnerabilities in VLMs, highlighting the urgent need for improving robust defenses and multimodal alignment.

Key words: large vision-language models, jailbreak attacks, red teaming, security of large models, safety alignment