Gelir Sınıflandırma Kıyaslaması: R (Akademik Çalışma) ile Python (ML Hattı) Dönüşümü


Creative Commons License

Erkan M. A.

Diğer, ss.1-6, 2025

  • Yayın Türü: Diğer Yayınlar / Diğer
  • Basım Tarihi: 2025
  • Sayfa Sayıları: ss.1-6
  • Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

The primary aim of this research is to construct a robust machine learning pipeline for income classification, predicting whether an individual earns above $50K based on demographic attributes such as work class, education, race, and gender. Initially developed as a statistical study in R-Studio to explore variable relationships and perform exploratory data analysis (EDA), the project has been significantly refactored into a production-ready Python environment to demonstrate modern MLOps standards.

The methodology involves an end-to-end pipeline utilizing Scikit-Learn, incorporating advanced data cleaning, K-Nearest Neighbors (KNN) imputation for missing values, and automated feature scaling. While the initial research explored a broad range of algorithms, the current benchmark focuses on comparing the performance of Logistic Regression, Decision Trees, and Random Forest algorithms to establish a strong baseline. Model performance was rigorously assessed using Accuracy, Sensitivity, and F1-Score to account for categorical complexity. This dual-language approach highlights the transition from academic statistical inference to applied machine learning engineering.