This project focuses on detecting potential money laundering activities within a dataset of synthesized financial transactions provided by IBM (International Business Machines Corporation). These transactions represent interactions between individuals, businesses, and banks, covering a wide range of financial activities such as consumer purchases, industrial orders, salary payments, loan repayments, and more.
The dataset aims to simulate a real-world scenario where a small subset of individuals and businesses are involved in illicit activities like smuggling, illegal gambling, and extortion. The project follows the money laundering cycle—Placement, Layering, and Integration—which criminals use to obscure the origins of illegally obtained funds. The primary objective of this project is to utilize data analysis and machine learning techniques to help authorities identify and distinguish potential money laundering cases hidden within the legitimate financial transactions.
- Introduction
- Money Laundering Cycle
- Objective
- Dataset
- Analysis and Key Insights
- Data Preprocessing Steps
- Model Training
- Model Performance
- Conclusion
- Acknowledgement
- Contributors
- License
- Placement: Introduction of illegal funds into the financial system.
- Layering: Complex series of financial transactions to obscure the origins of illegal money.
- Integration: Final stage where the illegal money appears to be legitimate and is used in normal financial activities.
The primary goal is to employ machine learning and data analysis techniques to detect and identify potential money laundering activities among the provided transactions. The dataset contains a mix of legitimate and suspicious transactions, allowing for the creation of models that can distinguish between the two.
The dataset used in this project was sourced from Kaggle and generated by IBM. It models a virtual financial system where individuals, companies, and banks engage in various transactions. The machine learning models were trained using the HI-Large_Trans.csv dataset, which has a relatively higher ratio of illicit transactions (more laundering), while the LI-Large_Trans.csv dataset, containing a lower illicit ratio (less laundering), was used for testing. The datasets offer a comprehensive view of both legitimate and suspicious transactions, allowing for the training and evaluation of machine learning algorithms. You can access the dataset through this link.
- Timestamp: The date and time when the transaction occurred.
- From Bank: The bank sending the money.
- To Bank: The bank receiving the money.
- Amount Received: The amount of money received in the transaction.
- Receiving Currency: The currency in which the amount was received.
- Amount Paid: The amount of money paid in the transaction.
- Payment Currency: The currency in which the amount was paid.
- Payment Format: The mode of payment (e.g., ACH, Bitcoin, Credit Card, etc.).
- Is Laundering: A flag indicating whether the transaction is suspected of laundering (1 for laundering, 0 for non-laundering).
-
Exploratory Data Analysis (EDA): Histograms, boxplots, and scatter plots were used to understand data distributions and correlations between transaction amounts. The analysis revealed key outliers in
Amount Received
andAmount Paid
, which indicate unusual transactions. Outliers are crucial for identifying money laundering patterns. -
Correlation Analysis: A strong correlation exists between
Amount Received
andAmount Paid
, suggesting that transactions with larger amounts tend to show clearer signs of potential laundering activities. -
Key Insights:
-
Undersampling: Given the imbalance between legitimate transactions and money laundering cases, undersampling is employed to balance the dataset. This technique reduces the number of non-laundering transactions to equalize the distribution with laundering transactions, which improves model performance by avoiding bias toward the majority class.
-
Feature Engineering:
- Temporal features such as year, month, day, and hour are extracted from the timestamp.
- Currency and payment format columns are encoded using one-hot encoding, with less frequent categories grouped under an "Others" category.
- Categorical features such as bank and account numbers are frequency encoded to avoid high dimensionality caused by one-hot encoding.
-
Scaling: Robust scaling is applied to numerical features like
Amount Received
andAmount Paid
to handle outliers and ensure proper normalization.
We employed four models to evaluate performance on the HI-Large_Trans_Sampled.csv dataset, and the top two performing models, Random Forest Classifier and XGBoost, were selected for further testing on the LI-Large_Trans_Sampled.csv dataset.
Model | Class | Precision | Recall | F1-Score | Accuracy |
---|---|---|---|---|---|
Logistic Regression | 0 | 0.88 | 0.87 | 0.88 | 0.88 |
1 | 0.87 | 0.88 | 0.88 | ||
Random Forest Classifier | 0 | 0.95 | 0.84 | 0.89 | 0.90 |
1 | 0.86 | 0.95 | 0.90 | ||
XGBoost Classifier | 0 | 0.93 | 0.86 | 0.90 | 0.90 |
1 | 0.87 | 0.93 | 0.90 | ||
Stacking Classifier | 0 | 0.92 | 0.87 | 0.89 | 0.89 |
1 | 0.87 | 0.92 | 0.90 |
Based on the analysis conducted on the HI-Large_Trans_Sampled.csv dataset, our team identified the Random Forest Classifier and XGBoost as the top-performing models, achieving the highest accuracy of 90%. These models demonstrated robustness and superior predictive power in discerning between money laundering and non-money laundering transactions, providing a solid foundation for subsequent testing and predictions.
The following table summarizes the performance of the models after testing on the LI-Large_Trans dataset:
Model | Class | Precision | Recall | F1-Score | Accuracy | ROC AUC Score |
---|---|---|---|---|---|---|
Random Forest | 0 | 0.91 | 0.81 | 0.86 | 0.87 | 0.8687 |
1 | 0.83 | 0.92 | 0.87 | |||
XGBoost | 0 | 0.91 | 0.84 | 0.87 | 0.88 | 0.8687 |
1 | 0.85 | 0.92 | 0.88 |
This dataset allows for robust analysis of money laundering activities, providing insights into financial behaviors that could help authorities detect and prevent illicit financial transactions. Through effective data preprocessing, machine learning models can be trained to flag suspicious activities, helping reduce the prevalence of financial crime. The adoption of undersampling in testing the prior model from the original dataset (HI-Large_Trans) with the new dataset (LI-Large_Trans) was crucial due to class imbalance, ensuring a balanced representation between money laundering and non-money laundering transactions.
Both Random Forest and XGBoost classifiers typically exhibit superior performance compared to other models because of their ensemble learning techniques, adeptness in managing complex relationships in data, and ability to handle missing values. However, despite similar column names between the datasets, slight variations in the data's underlying patterns or distribution might result
This project is part of the Preliminary Round of the Data Science Competition Olympiad 2023, organized by the Data Science Club of Binus University.
This project is licensed under the MIT License - see the LICENSE file for details.