SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote Sensing

Köksal, AYBORA; Alatan, ABDULLAH

doi:10.1109/jstars.2025.3637115

SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote Sensing

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, cilt.19, ss.795-804, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 19
Basım Tarihi: 2026
Doi Numarası: 10.1109/jstars.2025.3637115
Dergi Adı: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Geobase, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.795-804
Anahtar Kelimeler: Aerial image analysis, chain-of-thought (CoT) reasoning, domain adaptation, group relative policy optimization (GRPO), multimodal large language models (MLLMs), remote sensing (RS)
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations- has remained limited. In this work, a lightweight multimodal language model termed SAMChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, SAMData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group Relative Policy Optimization (GRPO) was leveraged to enhance the model's ability to detect critical domain-specific cues-such as defensive layouts and key military structures-while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that SAMChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed SAMData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications. Code and dataset will be available upon acceptance.