Predicting the truck factor in a software repository using machine learning

El Cheikh Ammar, Ahmed; ERASLAN, ŞÜKRÜ; YILMAZ, YELİZ

doi:10.1016/j.infsof.2025.107765

Predicting the truck factor in a software repository using machine learning

El Cheikh Ammar A., ERASLAN Ş., YILMAZ Y.

Information and Software Technology, cilt.184, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 184
Basım Tarihi: 2025
Doi Numarası: 10.1016/j.infsof.2025.107765
Dergi Adı: Information and Software Technology
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, ABI/INFORM, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Compendex, Computer & Applied Sciences, INSPEC, Library, Information Science & Technology Abstracts (LISTA), DIALNET
Anahtar Kelimeler: Bus factor, GitHub, Machine learning, Naive Bayes, Random Forest, Truck factor
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Context: The Truck or Bus factor is a metric that evaluates which developers would cause the development process in a software project to decelerate should they get removed (or hit by a truck/bus). Measuring the truck factor in software development is complex due to the many variables involved. Several algorithms have been developed to address this. However, they suffer from the fact that they tend to tunnel vision on code-centric metrics such as commits made by a developer. While such a feature is important in assessing the contribution of a developer, it does not tell the whole story behind a contribution. Objective: This paper aims to consider a comprehensive set of version control system (VCS) features, including those that have not yet been investigated in the literature, with Machine Learning (ML) to predict Truck Factor. Method: We examine what features existing algorithms utilize and then design a feature set that addresses various coding-based metrics, collaborative behaviors, developer activity patterns, and the broader technological context of a project. Afterwards, multiple supervised ML models with different algorithms, such as Random Forest, Naive Bayes, etc., are designed to utilize this feature set to predict the key contributors in GitHub repositories, ultimately computing the truck factor, and then these ML models are compared with the literature. Results: Random Forest with hypertuned parameters and an aggregated model of hypertuned Random Forest and Naive Bayes with priors achieve the best performance, with mean F1-Scores of 84.1% and 86.4%, respectively. These models outperform existing algorithms except one of them, which lagged slightly behind in terms of precision. Conclusion: Our research addresses the limitations of existing work by investigating a wider range of VCS features and developing a supervised ML model to predict the truck factor, which demonstrates robust identification of true Truck Factor members.