Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Altay B., Dokeroglu T., COŞAR A.

SOFT COMPUTING, vol.23, no.12, pp.4177-4191, 2019 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 23 Issue: 12
  • Publication Date: 2019
  • Doi Number: 10.1007/s00500-018-3066-4
  • Journal Name: SOFT COMPUTING
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Page Numbers: pp.4177-4191
  • Keywords: Malicious, Webpage, Classification, SVM, Maximum entropy, Extreme learning machines, Keyword density
  • Middle East Technical University Affiliated: Yes


Conventional malicious webpage detection methods use blacklists in order to decide whether a webpage is malicious or not. The blacklists are generally maintained by third-party organizations. However, keeping a list of all malicious Web sites and updating this list regularly is not an easy task for the frequently changing and rapidly growing number of webpages on the web. In this study, we propose a novel context-sensitive and keyword density-based method for the classification of webpages by using three supervised machine learning techniques, support vector machine, maximum entropy, and extreme learning machine. Features (words) of webpages are obtained from HTML contents and information is extracted by using feature extraction methods: existence of words, keyword frequencies, and keyword density techniques. The performance of proposed machine learning models is evaluated by using a benchmark data set which consists of one hundred thousand webpages. Experimental results show that the proposed method can detect malicious webpages with an accuracy of 98.24%, which is a significant improvement compared to state-of-the-art approaches.