Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Dataset Copyright Auditing for Large Models: Fundamentals, Open Problems, and Future Directions
DU Linkang, SU Zhou, YU Xinyi
ZTE Communications    2025, 23 (3): 38-47.   DOI: 10.12142/ZTECOM.202503005
Abstract159)   HTML3)    PDF (511KB)(99)       Save

The unprecedented scale of large models, such as large language models (LLMs) and text-to-image diffusion models, has raised critical concerns about the unauthorized use of copyrighted data during model training. These concerns have spurred a growing demand for dataset copyright auditing techniques, which aim to detect and verify potential infringements in the training data of commercial AI systems. This paper presents a survey of existing auditing solutions, categorizing them across key dimensions: data modality, model training stage, data overlap scenarios, and model access levels. We highlight major trends, including the prevalence of black-box auditing methods and the emphasis on fine-tuning rather than pre-training. Through an in-depth analysis of 12 representative works, we extract four key observations that reveal the limitations of current methods. Furthermore, we identify three open challenges and propose future directions for robust, multimodal, and scalable auditing solutions. Our findings underscore the urgent need to establish standardized benchmarks and develop auditing frameworks that are resilient to low watermark densities and applicable in diverse deployment settings.

Table and Figures | Reference | Related Articles | Metrics