An evaluation of a novel approach for clustering genes with dissimilar replicates

Cinar, Ozan; İYİGÜN, CEM; İLK DAĞ, ÖZLEM

doi:10.1080/03610918.2020.1839092

An evaluation of a novel approach for clustering genes with dissimilar replicates

Cinar O., İYİGÜN C., İLK DAĞ Ö.

COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, cilt.51, sa.12, ss.7458-7471, 2022 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 51 Sayı: 12
Basım Tarihi: 2022
Doi Numarası: 10.1080/03610918.2020.1839092
Dergi Adı: COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Business Source Elite, Business Source Premier, CAB Abstracts, Compendex, Computer & Applied Sciences, Veterinary Science Database, zbMATH, Civil Engineering Abstracts
Sayfa Sayıları: ss.7458-7471
Anahtar Kelimeler: cluster validation, Clustering, microarray gene expression, replication, short time-series
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Clustering the genes is a step in microarray studies which demands several considerations. First, the expression levels can be collected as time-series which should be accounted for appropriately. Furthermore, genes may behave differently in different biological replicates due to their genetic backgrounds. Highlighting such genes may deepen the study; however, it introduces further complexities for clustering. The third concern stems from the existence of a large amount of constant genes which demands a heavy computational burden. Finally, the number of clusters is not known in advance; therefore, a clustering algorithm should be able to recommend meaningful number of clusters. In this study, we evaluate a recently proposed clustering algorithm that promises to address these issues with a simulation study. The methodology accepts each gene as a combination of its replications and accounts for the time dependency. Furthermore, it computes cluster validation scores to suggest possible numbers of clusters. Results show that the methodology is able to find the clusters and highlight the genes with differences among the replications, separate the constant genes to reduce the computational burden, and suggest meaningful number of clusters. Furthermore, our results show that traditional distance metrics are not efficient in clustering the short time-series correctly.