Diğer, ss.1-6, 2025
The primary aim of this research is to construct a robust machine learning pipeline for income classification, predicting whether an individual earns above $50K based on demographic attributes such as work class, education, race, and gender. Initially developed as a statistical study in R-Studio to explore variable relationships and perform exploratory data analysis (EDA), the project has been significantly refactored into a production-ready Python environment to demonstrate modern MLOps standards.
The methodology involves an end-to-end pipeline utilizing Scikit-Learn, incorporating advanced data cleaning, K-Nearest Neighbors (KNN) imputation for missing values, and automated feature scaling. While the initial research explored a broad range of algorithms, the current benchmark focuses on comparing the performance of Logistic Regression, Decision Trees, and Random Forest algorithms to establish a strong baseline. Model performance was rigorously assessed using Accuracy, Sensitivity, and F1-Score to account for categorical complexity. This dual-language approach highlights the transition from academic statistical inference to applied machine learning engineering.