An evaluation of a novel approach for clustering genes with dissimilar replicates


Cinar O., İYİGÜN C., İLK DAĞ Ö.

COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, cilt.51, sa.12, ss.7458-7471, 2022 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 51 Sayı: 12
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1080/03610918.2020.1839092
  • Dergi Adı: COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Business Source Elite, Business Source Premier, CAB Abstracts, Compendex, Computer & Applied Sciences, Veterinary Science Database, zbMATH, Civil Engineering Abstracts
  • Sayfa Sayıları: ss.7458-7471
  • Anahtar Kelimeler: cluster validation, Clustering, microarray gene expression, replication, short time-series
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Clustering the genes is a step in microarray studies which demands several considerations. First, the expression levels can be collected as time-series which should be accounted for appropriately. Furthermore, genes may behave differently in different biological replicates due to their genetic backgrounds. Highlighting such genes may deepen the study; however, it introduces further complexities for clustering. The third concern stems from the existence of a large amount of constant genes which demands a heavy computational burden. Finally, the number of clusters is not known in advance; therefore, a clustering algorithm should be able to recommend meaningful number of clusters. In this study, we evaluate a recently proposed clustering algorithm that promises to address these issues with a simulation study. The methodology accepts each gene as a combination of its replications and accounts for the time dependency. Furthermore, it computes cluster validation scores to suggest possible numbers of clusters. Results show that the methodology is able to find the clusters and highlight the genes with differences among the replications, separate the constant genes to reduce the computational burden, and suggest meaningful number of clusters. Furthermore, our results show that traditional distance metrics are not efficient in clustering the short time-series correctly.