Malicious Code Detection: Run Trace Output Analysis by LSTM

ACARTÜRK, CENGİZ; ŞIRLANCI, MELİH; GÜRKAN BALIKÇIOĞLU, PINAR; Demirci, Deniz; Sahin, Nazenin; ACAR KÜÇÜK, ÖZGE

doi:10.1109/access.2021.3049200

Malicious Code Detection: Run Trace Output Analysis by LSTM

ACARTÜRK C., ŞIRLANCI M., GÜRKAN BALIKÇIOĞLU P., Demirci D., Sahin N., ACAR KÜÇÜK Ö.

IEEE ACCESS, cilt.9, ss.9625-9635, 2021 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 9
Basım Tarihi: 2021
Doi Numarası: 10.1109/access.2021.3049200
Dergi Adı: IEEE ACCESS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.9625-9635
Anahtar Kelimeler: Malware, Machine learning, Feature extraction, Static analysis, Semantics, Operating systems, Natural language processing, Dynamic analysis, LSTM, malware detection, natural language processing, run trace
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Malicious software threats and their detection have been gaining importance as a subdomain of information security due to the expansion of ICT applications in daily settings. A major challenge in designing and developing anti-malware systems is the coverage of the detection, particularly the development of dynamic analysis methods that can detect polymorphic and metamorphic malware efficiently. In the present study, we propose a methodological framework for detecting malicious code by analyzing run trace outputs by Long Short-Term Memory (LSTM). We developed models of run traces of malicious and benign Portable Executable (PE) files. We created our dataset from run trace outputs obtained from dynamic analysis of PE files. The obtained dataset was in the instruction format as a sequence and was called Instruction as a Sequence Model (ISM). By splitting the first dataset into basic blocks, we obtained the second one called Basic Block as a Sequence Model (BSM). The experiments showed that the ISM achieved an accuracy of 87.51% and a false positive rate of 18.34%, while BSM achieved an accuracy of 99.26% and a false positive rate of 2.62%.