CRoM and HuspExt: Improving Efficiency of High Utility Sequential Pattern Extraction

Alkan, Oznur; KARAGÖZ, PINAR

doi:10.1109/tkde.2015.2420557

CRoM and HuspExt: Improving Efficiency of High Utility Sequential Pattern Extraction

Alkan O. K., KARAGÖZ P.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, cilt.27, sa.10, ss.2645-2657, 2015 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 10
Basım Tarihi: 2015
Doi Numarası: 10.1109/tkde.2015.2420557
Dergi Adı: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.2645-2657
Anahtar Kelimeler: High utility sequential pattern mining, efficiency, candidate pattern pruning, sequential pattern mining, ALGORITHM
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

High utility sequential pattern mining has been considered as an important research problem and a number of relevant algorithms have been proposed for this topic. The main challenge of high utility sequential pattern mining is that, the search space is large and the efficiency of the solutions is directly affected by the degree at which they can eliminate the candidate patterns. Therefore, the efficiency of any high utility sequential pattern mining solution depends on its ability to reduce this big search space, and as a result, lower the computational complexity of calculating the utilities of the candidate patterns. In this paper, we propose efficient data structures and pruning technique which is based on Cumulated Rest of Match (CRoM) based upper bound. CRoM, by defining a tighter upper bound on the utility of the candidates, allows more conservative pruning before candidate pattern generation in comparison to the existing techniques. In addition, we have developed an efficient algorithm, High Utility Sequential Pattern Extraction (HuspExt), which calculates the utilities of the child patterns based on that of the parents'. Substantial experiments on both synthetic and real datasets from different domains show that, the proposed solution efficiently discovers high utility sequential patterns from large scale datasets with different data characteristics, under low utility thresholds.