Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images
- Publicada
- Servidor
- Preprints.org
- DOI
- 10.20944/preprints202512.1915.v1
Few-shot object detection (FSOD) in high-resolution remote sensing (RS) imagery remains challenging due to scarce annotations, large intra-class variability, and high visual similarity between categories, which together limit the generalization ability of convolutional neural network (CNN)-based detectors. To address this issue, we explore leveraging large vision-language models (LVLMs) for FSOD in RS. We propose a two-stage, parameter-efficient fine-tuning framework with hierarchical prompting that adapts Qwen3-VL for object detection. In the first stage, low-rank adaptation (LoRA) modules are inserted into the vision and text encoders, and a Detection Transformer (DETR)-style detection head is trained on fully annotated base classes. In the second stage, the vision LoRA parameters are frozen, the text encoder is updated using K-shot novel-class samples, and the detection head is partially frozen, with selected components refined under a three-level hierarchical prompting scheme from superclasses to fine-grained class descriptions. To preserve base-class performance and reduce class confusion, we further introduce knowledge distillation and semantic consistency losses. Experiments on the DIOR and NWPU VHR-10.v2 datasets show that the proposed method consistently improves novel-class performance while maintaining competitive base-class accuracy and surpasses existing baselines, demonstrating the effectiveness of integrating hierarchical semantic reasoning into LVLM-based FSOD for RS imagery.