Official implementation of "FAR-FD: Feature Augmented Retrieval for Fraud Detection"
Abstract: Fraud detection plays a crucial role in the financial industry, preventing significant financial losses. Traditional rule-based systems and manual audits often struggle with the evolving nature of fraud schemes and the vast volume of transactions. While traditional machine learning and deep learning methods have made headway, significant room for improvement remains with problems such as class imbalance, high feature cardinality and adversarial dynamics. To address these limitations, we propose FAR-FD, the first work to integrate a subset of important features in Retrieval Augmented Classification (RAC), and the second work to use RAC for fraud detection. Our model utilises a pre-trained SAINT encoder, a self-supervised learning method, comprising of retrieval, integration, and predictor modules, jointly trained to dynamically leverage similar instances for each input sample. This approach not only enables the model to utilize the context of similar fraud patterns but uniquely positions it for real-time fraud detection by maintaining an external database that can be continuously updated as sophisticated fraud patterns emerge without requiring model retraining. We validate the effectiveness of FAR-FD through extensive experiments on a large scale real-world dataset and achieve state-of-the-art performance in detecting fraudulent activities. Our code is available at https://github.com/annimukherjee/FAR-FD.
TLDR; Using few important features as context in a retreival augmented classification system to predicit fraud.
- Clone the repository.
- Download the IEEE CIS Dataset from Kaggle (link) and save it in a new directory called
datasets/ieee-fraud-detection-datasets/{train/test}. Move thetest_identity.csvandtest_transaction.csvinto thedatasets/ieee-fraud-detection-datasets/testand thetrain_identity.csvandtrain_transaction.csvand move it todatasets/ieee-fraud-detection-datasets/train. - Run
0_combine_ieee_cis.ipynbin0_pre-process-datawith the Kaggle dataset. This will generate two dataframes in thedatasets/ieee-processeddirectory named ieee-train-merged.csv and ieee-train-merged_imputed_cleaned.csv. We will useieee-train-merged_imputed_cleaned.csvfor the rest of the experiments. - Now, we have to encode our dataset. We use the SAINT encoder published in this paper (official code). We use the unofficial SAINT implementation for our experiments. Using the generated
ieee-train-merged_imputed_cleaned.csvrun the0_ieee-preprocess-dataset-split-saint.ipynbnotebook. - Then run
dataset_split.ipynbto obtain the train, validation and test splits.
We provide the preprocessed datasets used in our experiments. You can download them from the following link:
https://www.kaggle.com/competitions/ieee-fraud-detection/data

