MachineTFBS: Motif-based method to predict transcription factor binding sites with first-best models from machine learning library


YAMAN O. U., ÇALIK P.

Biochemical Engineering Journal, cilt.198, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 198
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1016/j.bej.2023.108990
  • Dergi Adı: Biochemical Engineering Journal
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Aquatic Science & Fisheries Abstracts (ASFA), BIOSIS, Biotechnology Research Abstracts, CAB Abstracts, Chimica, Compendex, EMBASE, Food Science & Technology Abstracts, INSPEC, Veterinary Science Database
  • Anahtar Kelimeler: Deep learning, eXtreme Gradient Boosting, Machine learning, MachineTFBS, Transcription factor, Transcription factor binding site prediction, Yeast
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

This study aims to model the high-affinity transcription factor (TF) binding sites (TFBSs) in Saccharomyces cerevisiae promoters using high-throughput experimental TF binding data to train, test, and validate models. We introduce MachineTFBS, a machine learning based TFBS prediction tool consisting of a library of machine learning models generated by treating each TF's protein binding microarray (PBM) dataset as an individual optimization problem and selectively generating the first-best performing models. Since the modeling quality of machine learning methods varies for different TFs, we tested Random Forest, eXtreme Gradient Boosting, and up to 5-depth Deep Learning models. We extracted 274 high-resolution PBM datasets of 159 S. cerevisiae TFs from the UniProbe database. We designed an algorithm for the greedy selection of the first-best models and feature combinations from five feature extraction functions for each PBM data. The optimal subset of the functions was identified using a feature elimination algorithm. The feature elimination algorithm advances in successive cycles for each PBM data until the elimination of any feature(s) improves model performance or the feature set size is reduced to 1; in turn, reports the optimal subset. MachineTFBS predicts TFBSs with an average Matthews Correlation Coefficient score of 0.873.