Online embedding and clustering of evolving data streams


Zubaroglu A., Atalay M. V.

STATISTICAL ANALYSIS AND DATA MINING, cilt.16, sa.1, ss.29-44, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 16 Sayı: 1
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1002/sam.11590
  • Dergi Adı: STATISTICAL ANALYSIS AND DATA MINING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Compendex, INSPEC, zbMATH
  • Sayfa Sayıları: ss.29-44
  • Anahtar Kelimeler: data streams, drift adaptation, drift detection, evolving data streams, stream clustering
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Number of connected devices is steadily increasing and this trend is expected to continue in the near future. Connected devices continuously generate data streams and the data streams may often be high dimensional and contain concept drift. Clustering is one of the most suitable methods for real-time data stream processing, since clustering can be applied with less prior information about the data. Also, data embedding makes the visualization of high dimensional data possible and may simplify clustering process. There exist several data stream clustering algorithms in the literature; however, no data stream embedding method exists. Uniform Manifold Approximation and Projection (UMAP) is a data embedding algorithm that is suitable to be applied on stationary (stable) data streams, though it cannot adapt concept drift. In this study, we describe a novel method EmCStream, to apply UMAP on evolving (nonstationary) data streams, to detect and adapt concept drift and to cluster embedded data instances using a distance or partitioning-based clustering algorithm. We have evaluated EmCStream against the state-of-the-art stream clustering algorithms using both synthetic and real data streams containing concept drift. EmCStream outperforms DenStream and CluStream, in terms of clustering quality, on both synthetic and real evolving data streams.