ZTE Communications ›› 2025, Vol. 23 ›› Issue (3): 38-47.DOI: 10.12142/ZTECOM.202503005

• Special Topic • Previous Articles     Next Articles

Dataset Copyright Auditing for Large Models: Fundamentals, Open Problems, and Future Directions

DU Linkang, SU Zhou(), YU Xinyi   

  1. Xi’an Jiaotong University, Xi’an 710049, China
  • Received:2025-06-20 Online:2025-09-11 Published:2025-09-11
  • Contact: SU Zhou
  • About author:DU Linkang received his BE and PhD degrees from Zhejiang University, China in 2018 and 2023, respectively. He is currently an assistant professor at the School of Cyber Science and Engineering, Xi’an Jiaotong University, China. His research interests include privacy-preserving computing and trustworthy machine learning.
    SU Zhou (zhousu@xjtu.edu.cn) is a professor with Xi’an Jiaotong University, China and his research interests include multimedia communication, wireless communication, network security and network traffic. He received the Best Paper Award of International Conference IEEE AIoT 2024, IEEE WCNC 2023, IEEE VTC-Fall 2023, IEEE ICC 2020, etc. He is an associate editor of the IEEE Internet of Things Journal and the IEEE Open Journal of Computer Society, and the chair of IEEE VTS Xi’an Chapter Section.
    YU Xinyi is currently pursuing her master’s degree at the School of Cyber Science and Engineering, Xi’an Jiaotong University, China. She received her bachelor’s degree in computer science and technology from Hefei University of Technology, China. Her research interests include privacy protection and data traceability within machine learning systems.
  • Supported by:
    NSFC(62402379);NSFC(U22A2029);NSFC(U24A20237)

Abstract:

The unprecedented scale of large models, such as large language models (LLMs) and text-to-image diffusion models, has raised critical concerns about the unauthorized use of copyrighted data during model training. These concerns have spurred a growing demand for dataset copyright auditing techniques, which aim to detect and verify potential infringements in the training data of commercial AI systems. This paper presents a survey of existing auditing solutions, categorizing them across key dimensions: data modality, model training stage, data overlap scenarios, and model access levels. We highlight major trends, including the prevalence of black-box auditing methods and the emphasis on fine-tuning rather than pre-training. Through an in-depth analysis of 12 representative works, we extract four key observations that reveal the limitations of current methods. Furthermore, we identify three open challenges and propose future directions for robust, multimodal, and scalable auditing solutions. Our findings underscore the urgent need to establish standardized benchmarks and develop auditing frameworks that are resilient to low watermark densities and applicable in diverse deployment settings.

Key words: dataset copyright auditing, large language models, diffusion models, multimodal auditing, membership inference