Algorithmic Disparities in Data-Driven Decision Systems: An Empirical Evaluation of Group-Level Error and Calibration Differences
- Posted
- Server
- Preprints.org
- DOI
- 10.20944/preprints202603.2252.v1
Automated decision systems are increasingly deployed in high-stakes domains such as credit allocation, hiring, and healthcare screening. Although sensitive demographic attributes are often excluded from model training, concerns remain regarding unequal predictive behavior across population groups. This study presents an empirical evaluation of subgroup-level predictive performance, error disparities, and calibration reliability using the Adult Income benchmark dataset. Logistic Regression and Random Forest classifiers are evaluated using a leakage-free nested cross-validation framework. Beyond aggregate performance metrics, we analyze false negative rates across demographic groups, statistically test observed disparities using bootstrap resampling, and examine probability calibration behavior. The results indicate that false negative rates differ systematically across sex and race groups, with several disparities remaining statistically significant. Furthermore, improvements in overall discrimination achieved by the Random Forest model do not uniformly translate into improved probability calibration across demographic groups. These findings demonstrate that evaluating machine learning systems solely through aggregate accuracy may obscure important subgroup-level differences and highlight the importance of comprehensive evaluation practices when deploying automated decision systems.