Prediction of Non-coding Driver Mutations Using Ensemble Learning

Basharat S., Huseynov R., Kilinc H. H., OTLU SARITAŞ B.

2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Ankara, Türkiye, 05 Aralık 2023, ss.2960-2966, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/bibm58861.2023.10386056
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.2960-2966
Anahtar Kelimeler: Boosting, Driver Mutations, Ensemble Learning, Explainable AI, Long-range Interactions, Non-coding Mutations
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Driver coding mutations are extensively studied and frequently detected by their deleterious amino acid changes that affect protein function. However, non-coding mutations need further analysis and experimental validation to determine them as driver non-coding mutations. Here, we employ the XGBoost (eXtreme Gradient Boosting) algorithm to predict driver non-coding mutations based on novel long-range interaction features and engineered transcription factor binding site features augmented with features from existing annotation and effect prediction tools. Regarding novel long-range interaction features, we capture the frequency and spread of interacting regions overlapping with the non-coding mutation of interest. For this purpose, we use self-balancing trees to find overlaps within chromatin loop files and store the interacting regions as separate tree structures. For engineered transcription factor (TF) binding features, we train TF models utilizing the stochastic gradient descent (SGD) algorithm to predict the loss and gain of functions at transcription factor binding sites by giving more weight to the non-coding mutations affecting transcription factor binding affinities. We also include features from existing annotation and effect prediction tools; some rely on deep learning methods relating to splicing effect, number of associated protein products, variant consequences, biotypes, and others. For the known driver and non-driver non-coding mutations, the resulting aggregated dataset is trained with our gradient boosting model to predict driver versus passenger non-coding mutations. We then use non-coding driver mutations found in other state-of-the-art studies, similarly annotate them, and pass them through our model to make a comparison. Furthermore, we elaborate on the results by using explainable AI methodologies. Our results show an above-average performance on the unseen test data and suggest that using our annotations and training the resulting data using gradient boosting trees, the classification between a driver versus passenger non-coding mutation is possible with relatively high degrees of accuracy.