Skip to main content

Write a PREreview

Comparative Analysis of Supervised Learning Models for Detecting Credit Card and Bank Account Fraud

Posted
Server
Preprints.org
DOI
10.20944/preprints202510.1007.v1

The purpose of this study is to investigate the efficacy of three supervised learning models, Logistic Regression, Random Forest and XGBoost, on two datasets of financial fraud detection that were constructed differently with differing class distributions. The Credit Card Fraud Detection Dataset (Kaggle, 2023) is a synthetic dataset that has been artificially balanced to produce a 50:50 relative proportion of fraudulent and non-fraudulent observations to allow for performance of the models to be evaluated under ideal conditions. On the other hand, the Bank Account Fraud Dataset (NeurIPS, 2022) reflects real-world monetary behavior and features extreme class imbalance characterized by only approximately 1% of the observations containing fraudulent behavior. (Jesus et al., 2022) A single pipeline was constructed using stratified 60 / 20 / 20 splits and SMOTE applied only to the training set, Evaluation metrics included F1-score and AUC-ROC. The results reflect close to perfect outcomes on the balanced synthetic dataset but large degradation in performance on the real-world imbalanced dataset. The model that consistently performed best on the imbalanced dataset was XGBoost as represented by the F1 (23.4%) and AUC (89.3%) values. These results are consistent with published benchmarks indicating that F1-scores in the 15 to 25% range represent excellent outcomes in practice in detection of fraudulent behavior. The results of the present study underscore the critical impact of data imbalance and real-world practicality of the dataset used in the performance of supervised models and indicate future study to apply techniques such as cost-sensitive learning, explainability and temporal modeling of financial data in operational settings in order to achieve generalization with the models tested.

You can write a PREreview of Comparative Analysis of Supervised Learning Models for Detecting Credit Card and Bank Account Fraud. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now