Biochemical Engineering Journal, cilt.198, 2023 (SCI-Expanded)
This study aims to model the high-affinity transcription factor (TF) binding sites (TFBSs) in Saccharomyces cerevisiae promoters using high-throughput experimental TF binding data to train, test, and validate models. We introduce MachineTFBS, a machine learning based TFBS prediction tool consisting of a library of machine learning models generated by treating each TF's protein binding microarray (PBM) dataset as an individual optimization problem and selectively generating the first-best performing models. Since the modeling quality of machine learning methods varies for different TFs, we tested Random Forest, eXtreme Gradient Boosting, and up to 5-depth Deep Learning models. We extracted 274 high-resolution PBM datasets of 159 S. cerevisiae TFs from the UniProbe database. We designed an algorithm for the greedy selection of the first-best models and feature combinations from five feature extraction functions for each PBM data. The optimal subset of the functions was identified using a feature elimination algorithm. The feature elimination algorithm advances in successive cycles for each PBM data until the elimination of any feature(s) improves model performance or the feature set size is reduced to 1; in turn, reports the optimal subset. MachineTFBS predicts TFBSs with an average Matthews Correlation Coefficient score of 0.873.