Escribe una PREreview

Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images

por Yongqi Shi, Ruopeng Yang, Changsheng Yin, Yiwei Lu, Bo Huang, Yu Tao y Yihao Zhong

Publicada: 22 de diciembre de 2025
Servidor: Preprints.org
DOI: 10.20944/preprints202512.1915.v1

Few-shot object detection (FSOD) in high-resolution remote sensing (RS) imagery remains challenging due to scarce annotations, large intra-class variability, and high visual similarity between categories, which together limit the generalization ability of convolutional neural network (CNN)-based detectors. To address this issue, we explore leveraging large vision-language models (LVLMs) for FSOD in RS. We propose a two-stage, parameter-efficient fine-tuning framework with hierarchical prompting that adapts Qwen3-VL for object detection. In the first stage, low-rank adaptation (LoRA) modules are inserted into the vision and text encoders, and a Detection Transformer (DETR)-style detection head is trained on fully annotated base classes. In the second stage, the vision LoRA parameters are frozen, the text encoder is updated using K-shot novel-class samples, and the detection head is partially frozen, with selected components refined under a three-level hierarchical prompting scheme from superclasses to fine-grained class descriptions. To preserve base-class performance and reduce class confusion, we further introduce knowledge distillation and semantic consistency losses. Experiments on the DIOR and NWPU VHR-10.v2 datasets show that the proposed method consistently improves novel-class performance while maintaining competitive base-class accuracy and surpasses existing baselines, demonstrating the effectiveness of integrating hierarchical semantic reasoning into LVLM-based FSOD for RS imagery.

Puedes escribir una PREreview de Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.