ZTE Communications ›› 2025, Vol. 23 ›› Issue (4): 77-85.DOI: 10.12142/ZTECOM.202504009

• Research Papers • Previous Articles     Next Articles

Empowering Grounding DINO with MoE: An End-to-End Framework for Cross-Domain Few-Shot Object Detection

DONG Xiugang, ZHANG Kaijin, NONG Qingpeng, JU Minhan, TU Yaofeng()   

  1. Nanjing R&D Center, ZTE Corporation, Nanjing 210012, China
  • Received:2025-08-15 Online:2025-12-22 Published:2025-12-22
  • About author:DONG Xiugang is an AI engineer at ZTE Corporation. His work primarily focuses on the research and development of large computer vision and video understanding models. His research interests span multiple areas, including open-set object detection, semantic segmentation, video spatio-temporal localization, and general video understanding.
    ZHANG Kaijin is an AI engineer at ZTE Corporation, specializing in the research and development of large computer vision models. His research interests include open-set object detection, semantic segmentation, and keypoint detection.
    NONG Qingpeng is an AI engineer at ZTE Corporation, specializing in the research and development of large computer vision models. His research interests include open-set object detection, semantic segmentation, and keypoint detection.
    JU Minhan received his Bachelor's degree in data science and big data technology from Xi'an Jiaotong-Liverpool University, China. He is currently pursuing a master's degree in data science at the University of Sydney, Australia. His research interests include machine learning, data mining, and big data system modeling. He is now interning at ZTE Corporation.
    TU Yaofeng (tu.yaofeng@zte.com.cn) is the Deputy Dean of the Central Research Institute of ZTE Corporation. As a PhD and senior researcher, he focuses his research on big data, databases, AI, large models, and cloud computing.

Abstract:

Open-set object detectors, as exemplified by Grounding DINO, have attracted significant attention due to their remarkable performance on in-domain datasets like Common Objects in Context (COCO) after only few-shot fine-tuning. However, their generalization capabilities in cross-domain scenarios remain substantially inferior to their in-domain few-shot performance. Prior work on fine-tuning Grounding DINO for cross-domain few-shot object detection has primarily focused on data augmentation, leaving broader systemic optimizations unexplored. To bridge this gap, we propose a comprehensive end-to-end fine-tuning framework specifically designed to optimize Grounding DINO for cross-domain few-shot scenarios. In addition, we propose Mixture-of-Experts (MoE)-Grounding DINO, a novel architecture that integrates the MoE architecture to enhance adaptability in cross-domain settings. Our approach demonstrates a significant 15.4 Mean Average Precision (mAP) improvement over the Grounding DINO baseline on the Roboflow20-VL benchmark, establishing a new state of the art for cross-domain few-shot object detection (CD-FSOD). The source code and models will be made available upon publication.

Key words: cross-domain few-shot object detection, Grounding DINO, Mixture-of-Experts, open-set object detection, pseudo-labeling