Multi-class classification methods utilizing Mahalanobis Taguchi system and a re-sampling approach for imbalanced data sets

Thesis Type: Postgraduate

Institution Of The Thesis: Orta Doğu Teknik Üniversitesi, Faculty of Engineering, Department of Industrial Engineering, Turkey

Approval Date: 2009




Classification approaches are used in many areas in order to identify or estimate classes, which different observations belong to. The classification approach, Mahalanobis Taguchi System (MTS) is analyzed and further improved for multi-class classification problems under the scope of this thesis study. MTS tries to explore significant variables and classify a new observation based on its Mahalanobis distance (MD). In this study, first, sample size problems, which are encountered mostly in small data sets, and multicollinearity problems, which constitute some limitations of MTS, are analyzed and a re-sampling approach is explored as a solution. Our re-sampling approach, which only works for data sets with two classes, is a combination of over-sampling and under-sampling. Over-sampling is based on SMOTE, which generates the synthetic observations between the nearest neighbors of observations in the minority class. In addition, MTS models are used to test the performance of several re-sampling parameters, for which the most appropriate values are sought specific to each case. In the second part, multi-class classification methods with MTS are developed. An algorithm, namely Feature Weighted Multi-class MTS-I (FWMMTS-I), is inspired by the descent feature weighted MD. It relaxes adding up of the MDs for variables equally. This provides representations of noisy variables with weights close to zero so that they do not mask the other variables. As a second multi-class classification algorithm, the original MTS method is extended to multi-class problems, which is called Multi-class MTS (MMTS). In addition, a comparable approach to that of Su and Hsiao (2009), which also considers weights of variables, is studied with a modification in MD calculation. It is named as Feature Weighted Multi-class MTS-II (FWMMTS-II). The methods are compared on eight different multi-class data sets using a 5-fold stratified cross validation approach. Results show that FWMMTS-I is as accurate as MMTS, and they are better than FWMMTS-II. Interestingly, the Mahalanobis Distance Classifier (MDC) using all the variables directly in the classification model has performed equally well on the studied data sets.