Gaining insights in datasets in the shade of "garbage in, garbage out" rationale: Feature space distribution fitting


Creative Commons License

Canbek G.

WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2022 (Journal Indexed in SCI) identifier

  • Publication Type: Article / Article
  • Publication Date: 2022
  • Doi Number: 10.1002/widm.1456
  • Title of Journal : WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY
  • Keywords: binary classification, data preprocessing, data profiling, data quality, machine learning, POWER-LAW DISTRIBUTIONS, PATTERNS

Abstract

This article emphasizes comprehending the "Garbage In, Garbage Out" (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high-dimensional binary feature space. The results showed that the distributions fit well into two of the four long right-tail statistical distributions: log-normal, exponential, power law, and Poisson. Precisely, log-normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well-formed statistical methods provides a clear understanding of the datasets and intra-class and inter-class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand. This article is categorized under: Technologies > Data Preprocessing Technologies > Classification Technologies > Machine Learning