Implicit security requirements classification with large language models using the OWASP application security verification standard: a shift-left approach

Gür, Yusuf; TAŞKAYA TEMİZEL, TUĞBA; GÜNEL KILIÇ, BANU

doi:10.1007/s10664-026-10854-y

Implicit security requirements classification with large language models using the OWASP application security verification standard: a shift-left approach

Gür Y., TAŞKAYA TEMİZEL T., GÜNEL KILIÇ B.

Empirical Software Engineering, cilt.31, sa.5, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 31 Sayı: 5
Basım Tarihi: 2026
Doi Numarası: 10.1007/s10664-026-10854-y
Dergi Adı: Empirical Software Engineering
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
Anahtar Kelimeler: Cybersecurity requirement elicitation, LLM-based classification, OWASP ASVS-based requirement labeling, OWASP-based security requirement classification
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Cybersecurity threats require early integration of security, starting from the requirements analysis phase of the Software Development Life Cycle (SDLC). However, security requirements in Software Requirements Specification (SRS) documents are often implicitly embedded, making their manual identification time-consuming, error-prone, and reliant on specialized expertise. The accurate classification of security requirements (SR) is important for effective resource allocation and risk management in software development. Automated tools to extract implicit security requirements are lacking, largely due to the scarcity of large annotated datasets in Security Requirements Engineering (SRE). This paper proposes a data-driven methodology to automate the classification of implicit security requirements in SRS documents, supporting the early and systematic integration of security into software systems. We introduce a novel multi-label corpus, the Agency Security Requirements Dataset (ASRD), derived from 2,652 real-world requirement statements from six diverse SRS documents and annotated using a high-granularity taxonomy based on the OWASP Application Security Verification Standard (ASVS) V2-V13 and the MATTER cycle annotation framework by three cybersecurity experts. Using this dataset, we evaluate both supervised fine-tuned BERT variants (such as SecureBERT) and general-purpose large-language models (LLMs) including Gemma, GPT, Deep Seek, Meta Llama, and Gemini under zero-shot and few-shot settings. We conduct an empirical comparison between traditional fine-tuned transformer models and contemporary Large Language Models (LLMs) employing few-shot and zero-shot prompt engineering strategies. The results show that a few-shot prompting with Gemini 2.0 achieves a macro-average F1 score of 0.941, directly comparable to the 0.942 score achieved by the fine-tuned BERT model. This study culminates in two primary findings: first, the validation and publication of the ASRD, a high-granularity, multi-label dataset for implicit security requirements based on the OWASP ASVS V2-V13; and second, the direct comparison demonstrating that few-shot Large Language Models (LLMs) achieve competitive multi-label classification performance (Macro-F1 0.941) nearly equal to resource-intensive fine-tuned transformer models (Macro-F1 0.942). This confirms that LLMs represent a highly practical and resource-saving strategy for automating the identification of embedded (implicit) security requirements for software security in industrial SRS documents.