Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

VOTI: Jailbreaking Vision-Language Models via Visual Obfuscation and Task Induction

ZHU Yifan, CHU Zhixuan, REN Kui

ZTE Communications 2025, 23 (3): 15-26. DOI: 10.12142/ZTECOM.202503003

Abstract （172）

HTML （4）

PDF （6551KB）（108）

Save

In recent years, large vision-language models (VLMs) have achieved significant breakthroughs in cross-modal understanding and generation. However, the safety issues arising from their multimodal interactions become prominent. VLMs are vulnerable to jailbreak attacks, where attackers craft carefully designed prompts to bypass safety mechanisms, leading them to generate harmful content. To address this, we investigate the alignment between visual inputs and task execution, uncovering locality defects and attention biases in VLMs. Based on these findings, we propose VOTI, a novel jailbreak framework leveraging visual obfuscation and task induction. VOTI subtly embeds malicious keywords within neutral image layouts to evade detection, and breaks down harmful queries into a sequence of subtasks. This approach disperses malicious intent across modalities, exploiting VLMs’ over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms. Implemented as an automated framework, VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies. Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness, achieving a 73.46% attack success rate on GPT-4o-mini. These results reveal critical vulnerabilities in VLMs, highlighting the urgent need for improving robust defenses and multimodal alignment.

Table and Figures | Reference | Supplementary Material | Related Articles | Metrics

Select

Poison-Only and Targeted Backdoor Attack Against Visual Object Tracking

GU Wei, SHAO Shuo, ZHOU Lingtao, QIN Zhan, REN Kui

ZTE Communications 2025, 23 (3): 3-14. DOI: 10.12142/ZTECOM.202503002

Abstract （179）

HTML （4）

PDF （1597KB）（113）

Save

Visual object tracking (VOT), aiming to track a target object in a continuous video, is a fundamental and critical task in computer vision. However, the reliance on third-party resources (e.g., dataset) for training poses concealed threats to the security of VOT models. In this paper, we reveal that VOT models are vulnerable to a poison-only and targeted backdoor attack, where the adversary can achieve arbitrary tracking predictions by manipulating only part of the training data. Specifically, we first define and formulate three different variants of the targeted attacks: size-manipulation, trajectory-manipulation, and hybrid attacks. To implement these, we introduce Random Video Poisoning (RVP), a novel poison-only strategy that exploits temporal correlations within video data by poisoning entire video sequences. Extensive experiments demonstrate that RVP effectively injects controllable backdoors, enabling precise manipulation of tracking behavior upon trigger activation, while maintaining high performance on benign data, thus ensuring stealth. Our findings not only expose significant vulnerabilities but also highlight that the underlying principles could be adapted for beneficial uses, such as dataset watermarking for copyright protection.

Table and Figures | Reference | Related Articles | Metrics