A machine learning model to detect the fraudulent transactions.
This repository contains the detailed analysis on a dataset containing credit card transactions. The target variable has two classes (Normal and Fraud). The dataset is challenging because it is highly imbalanced. More than 99% data points belong to Normal class.
Download the dataset (csv format) from here.
Anaconda is highly recommended for executing any data science projects. It comes with a lots of pre-installed packages for data analysis and machine learning. Two packages needs to be manually installed beside installing Anaconda.
- Seaborn (pip install seaborn or conda install seaborn)
- Imbalanced-learn (pip install -U imbalanced-learn)
This notebook can be devided into the following sections:
- Data exploration
- Feature engineering
- Evaluation metrics
- Modeling
- Parameter tuning
After initial exploration, the dataset turns out to be highly imbalanced. Normal machine learning algorithms are biased towards the majority class. Resampling technique has been used to handle this problem. New features are generated based on the distribution of variables with in class. The accuracy metric is not useful for imbalanced class, so f1 ( harmonic mean of precision and recall ) and auc ( area under the roc curve) are used to evaluate the model performance. The usual threshold (probability = 0.5) is not used for classification. It has been tuned using cross-validation strategy.