Comparison of machine learning algorithms on consumer credit classification

Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Uygulamalı Matematik Enstitüsü, Finansal Matematik Anabilim Dalı, Türkiye

Tezin Onay Tarihi: 2019

Tezin Dili: İngilizce

Öğrenci: OĞUZ KOÇ

Asıl Danışman (Eş Danışmanlı Tezler İçin): Sevtap Ayşe Kestel

Eş Danışman: Ömür Uğur

Özet:

Like other prediction models, credit scoring is a tool used to evaluate the amount of risk associated with applicants or customers. Scoring models identify clients individually as good or bad applicants. They offer statistical odds or probabilities for prediction either the applicant will be default or not in the future. It is beneficial for banks and credit analysts to measure customers' non-payment risk by statistically tested algorithms in many aspects such as reduction in workload and evaluation time. Also, only demanding features that have the most significant impact on credit assessment process in terms of obtaining more explanatory outcomes, emphasizes the benefits mentioned formerly. Today, Machine Learning (ML) algorithms are commonly applied for data analysis in various areas. The algorithms learn how to determine complicated patterns and create smart choices by generating a mathematical model depending on sample dataset without direct programming. In this thesis, a comparative study is performed using Logistic Regression (LR), Support Vector Machine (SVM), Gaussian Naïve Bayes (GNB), Decision Trees (DT), Random Forest (DT), XGBoost (XGB), K-Nearest Neighbors (KNN) and Multilayer Perceptron Neural Network (MLP) algorithms. In addition to these, we strive to achieve more explanatory outcomes in terms of dimentionality with Wrapper Feature Selection (WFS), and investigate its performance in a way of important attributes detection capacity. We also analyze the impact of Grid Search (GS) hyper-parameters optimizing method, and effect of four data transformation techniques Natural Logarithm (LN), Standard, Box-Cox and Min-Max to these algorithms and methods. We compare these cases to determine the most appropriate way for credit classification by considering accuracy, AUC, type I and type II error rates. All measurements are conducted on German and Australian real world consumer credit datasets commonly used in literature.