Early Detection of Diabetes With Different Machine Learning Approach
- Posted
- Server
- OSF
- DOI
- 10.17605/osf.io/zwdsv
Early detection of diabetes is critical for effective management and prevention of complications. This study leverages DiaBD dataset to develop a machine learning approach for predicting diabetes status, utilizing clinical data from approximately 5,288 individuals after rigorous quality control. Key features include age, gender, vital signs (e.g., pulse rate, blood pressure), glucose levels, anthropometric measures (e.g., height, weight), and family history of diabetes and hypertension. Notably, the dataset presented two major challenges: class imbalance—with substantially fewer diabetic cases compared to non-diabetic cases—and data anomalies such as implausible numeric values (e.g., extreme glucose readings). Preprocessing steps included anomaly detection, and the use of stratified sampling to preserve class proportions during model training and evaluation. We evaluated multiple classification models—including Linear Discriminant Analysis (LDA), Random Forests, Gradient Boosting, Artificial Neural Networks (ANN), and others—using stratified cross-validation and an independent test set. Despite the imbalance, our best-performing model achieved a ROC-AUC of 0.85, demonstrating moderate-to-strong predictive capability. Feature importance analysis consistently highlighted glucose levels and weight as the most influential predictors. These findings underscore the potential of machine learning for diabetes risk stratification, while emphasizing the importance of addressing class imbalance and validating models on more representative datasets.