Detecting Anti-Money Laundering (AML) Activities Using Machine Learning Techniques

Introduction

This project focuses on detecting potential money laundering activities within a dataset of synthesized financial transactions provided by IBM (International Business Machines Corporation). These transactions represent interactions between individuals, businesses, and banks, covering a wide range of financial activities such as consumer purchases, industrial orders, salary payments, loan repayments, and more.

The dataset aims to simulate a real-world scenario where a small subset of individuals and businesses are involved in illicit activities like smuggling, illegal gambling, and extortion. The project follows the money laundering cycle—Placement, Layering, and Integration—which criminals use to obscure the origins of illegally obtained funds. The primary objective of this project is to utilize data analysis and machine learning techniques to help authorities identify and distinguish potential money laundering cases hidden within the legitimate financial transactions.

Money Laundering Cycle

Placement: Introduction of illegal funds into the financial system.
Layering: Complex series of financial transactions to obscure the origins of illegal money.
Integration: Final stage where the illegal money appears to be legitimate and is used in normal financial activities.

Objective

The primary goal is to employ machine learning and data analysis techniques to detect and identify potential money laundering activities among the provided transactions. The dataset contains a mix of legitimate and suspicious transactions, allowing for the creation of models that can distinguish between the two.

Dataset

The dataset used in this project was sourced from Kaggle and generated by IBM. It models a virtual financial system where individuals, companies, and banks engage in various transactions. The machine learning models were trained using the HI-Large_Trans.csv dataset, which has a relatively higher ratio of illicit transactions (more laundering), while the LI-Large_Trans.csv dataset, containing a lower illicit ratio (less laundering), was used for testing. The datasets offer a comprehensive view of both legitimate and suspicious transactions, allowing for the training and evaluation of machine learning algorithms. You can access the dataset through this link.

Dataset Features:

Timestamp: The date and time when the transaction occurred.
From Bank: The bank sending the money.
To Bank: The bank receiving the money.
Amount Received: The amount of money received in the transaction.
Receiving Currency: The currency in which the amount was received.
Amount Paid: The amount of money paid in the transaction.
Payment Currency: The currency in which the amount was paid.
Payment Format: The mode of payment (e.g., ACH, Bitcoin, Credit Card, etc.).
Is Laundering: A flag indicating whether the transaction is suspected of laundering (1 for laundering, 0 for non-laundering).

Analysis and Key Insights

Exploratory Data Analysis (EDA): Histograms, boxplots, and scatter plots were used to understand data distributions and correlations between transaction amounts. The analysis revealed key outliers in Amount Received and Amount Paid, which indicate unusual transactions. Outliers are crucial for identifying money laundering patterns.
Correlation Analysis: A strong correlation exists between Amount Received and Amount Paid, suggesting that transactions with larger amounts tend to show clearer signs of potential laundering activities.
Key Insights:
- The ACH payment format is the most commonly used for both laundering and non-laundering transactions.
- Money laundering transactions frequently involve USD as both receiving and payment currency.

Data Preprocessing Steps

Undersampling: Given the imbalance between legitimate transactions and money laundering cases, undersampling is employed to balance the dataset. This technique reduces the number of non-laundering transactions to equalize the distribution with laundering transactions, which improves model performance by avoiding bias toward the majority class.
Feature Engineering:
- Temporal features such as year, month, day, and hour are extracted from the timestamp.
- Currency and payment format columns are encoded using one-hot encoding, with less frequent categories grouped under an "Others" category.
- Categorical features such as bank and account numbers are frequency encoded to avoid high dimensionality caused by one-hot encoding.
Scaling: Robust scaling is applied to numerical features like Amount Received and Amount Paid to handle outliers and ensure proper normalization.

Model Training

We employed four models to evaluate performance on the HI-Large_Trans_Sampled.csv dataset, and the top two performing models, Random Forest Classifier and XGBoost, were selected for further testing on the LI-Large_Trans_Sampled.csv dataset.

Model	Class	Precision	Recall	F1-Score	Accuracy
Logistic Regression	0	0.88	0.87	0.88	0.88
	1	0.87	0.88	0.88
Random Forest Classifier	0	0.95	0.84	0.89	0.90
	1	0.86	0.95	0.90
XGBoost Classifier	0	0.93	0.86	0.90	0.90
	1	0.87	0.93	0.90
Stacking Classifier	0	0.92	0.87	0.89	0.89
	1	0.87	0.92	0.90

Based on the analysis conducted on the HI-Large_Trans_Sampled.csv dataset, our team identified the Random Forest Classifier and XGBoost as the top-performing models, achieving the highest accuracy of 90%. These models demonstrated robustness and superior predictive power in discerning between money laundering and non-money laundering transactions, providing a solid foundation for subsequent testing and predictions.

Model Performance

The following table summarizes the performance of the models after testing on the LI-Large_Trans dataset:

Model	Class	Precision	Recall	F1-Score	Accuracy	ROC AUC Score
Random Forest	0	0.91	0.81	0.86	0.87	0.8687
	1	0.83	0.92	0.87
XGBoost	0	0.91	0.84	0.87	0.88	0.8687
	1	0.85	0.92	0.88

Conclusion

This dataset allows for robust analysis of money laundering activities, providing insights into financial behaviors that could help authorities detect and prevent illicit financial transactions. Through effective data preprocessing, machine learning models can be trained to flag suspicious activities, helping reduce the prevalence of financial crime. The adoption of undersampling in testing the prior model from the original dataset (HI-Large_Trans) with the new dataset (LI-Large_Trans) was crucial due to class imbalance, ensuring a balanced representation between money laundering and non-money laundering transactions.

Both Random Forest and XGBoost classifiers typically exhibit superior performance compared to other models because of their ensemble learning techniques, adeptness in managing complex relationships in data, and ability to handle missing values. However, despite similar column names between the datasets, slight variations in the data's underlying patterns or distribution might result

Acknowledgement

This project is part of the Preliminary Round of the Data Science Competition Olympiad 2023, organized by the Data Science Club of Binus University.

Contributors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
Money_Laundering.ipynb		Money_Laundering.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Anti-Money Laundering (AML) Activities Using Machine Learning Techniques

Introduction

Table of Contents

Money Laundering Cycle

Objective

Dataset

Dataset Features:

Analysis and Key Insights

Data Preprocessing Steps

Model Training

Model Performance

Conclusion

Acknowledgement

Contributors

License

About

Releases

Packages

Languages

License

steveee27/Money-Laundering-Detection

Folders and files

Latest commit

History

Repository files navigation

Detecting Anti-Money Laundering (AML) Activities Using Machine Learning Techniques

Introduction

Table of Contents

Money Laundering Cycle

Objective

Dataset

Dataset Features:

Analysis and Key Insights

Data Preprocessing Steps

Model Training

Model Performance

Conclusion

Acknowledgement

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages