This repository contains the source code and additional material related to the paper "Advancing Fraud Detection Systems through Online Learning" [1] URL.
Important
This code requires a banking dataset that, for privacy reasons, we cannot share with the scientific community. The dataset is characterized by the following features: TransactionID, Amount, IBAN, IBAN_CC, IP, IDSessione, Timestamp,UserID, confirm_SMS, CC_ASN.
Note
If you use this code, please include the following citation:
@inproceedings{paladini2023advancingfds,
  author       = {Tommaso Paladini and
                  Martino {Bernasconi de Luca} and
                  Michele Carminati and
                  Mario Polino and
                  Francesco Trov{\`{o}} and
                  Stefano Zanero},
  title        = {{Advancing Fraud Detection Systems Through Online Learning}},
  booktitle    = {Machine Learning and Knowledge Discovery in Databases: Applied Data
                  Science and Demo Track - European Conference, {ECML} {PKDD} 2023,
                  Turin, Italy, September 18-22, 2023, Proceedings, Part {VI}},
  series       = {Lecture Notes in Computer Science},
  volume       = {14174},
  pages        = {275--292},
  publisher    = {Springer},
  year         = {2023},
  url          = {https://doi.org/10.1007/978-3-031-43427-3\_17},
  doi          = {10.1007/978-3-031-43427-3\_17},
  timestamp    = {Sun, 12 Nov 2023 02:12:57 +0100},
}
The rapid increase in digital transactions has led to a consequential surge in financial fraud, requiring an automatic way of defending effectively from such a threat. The past few years experienced a rise in the design and use by financial institutions of different machine learning-based fraud detection systems. However, these solutions may suffer severe drawbacks if a malevolent adversary adapts their behavior over time, making the selection of the existing fraud detectors difficult. In this paper, we study the application of online learning techniques to respond effectively to adaptive attackers. More specifically, the proposed approach takes as input a set of classifiers employed for fraud detection tasks and selects, based on the performances experienced in the past, the one to apply to analyze the next transaction. The use of an online learning approach guarantees to keep at a pace the loss due to the adaptive behavior of the attacker over a given learning period. To validate our methodology, we perform an extensive experimental evaluation using real-world banking data augmented with distinct fraudulent campaigns based on real-world attackers’ models. Our results demonstrate that the proposed approach allows prompt updates to detection models as new patterns and behaviors are occurring, leading to a more robust and effective fraud detection system.
This repository requires Python >=3.9. Create a virtual environment with $ python3.9 -m venv .venv, activate the environment with $ source .venv/bin/activate and install requirements with $ pip install -r requirements.txt.
Note
Differently than the paper, in the code, fraud strategies have different names:
SHORT_TERM (CODE) --> MEDIUM_TERM (PAPER)
SINGLE_FRAUD (CODE) --> SHORT_TERM (PAPER)
LONG_TERM (CODE) --> LONG_TERM (PAPER)
To create a synthetic fraud campaign, first create a configuration file with:
$ mkdir config/dataset_synth_fraud/campaigns
$ touch config/dataset_synth_fraud/campaigns/CAMPAIGN_NAME.json  # e.g., lt_is.jsonExample of configuration:
{
    "campaign_name": "lt_is",
    "victims_number_2014_15": 300,
    "strategies": [
        {
            "percentage": 100,
            "min_frauds": 28,
            "max_frauds": 112,
            "min_amount": 200,
            "max_amount": 1000,
            "total_amount": 50000,
            "duration_h": 672,
            "strategy": "rand_gauss",
            "hijack": false
        }
    ]
}Then, to generate the synthetic fraud run the following:
$ python -m synth_fraud_crafting.craft_campaign_all_year DATASET CAMPAIGN_NAMEWhere DATASET points to the banking data, while CAMPAIGN_NAME is the path of the configuration file manually included in the previous step.
Then, generate training and test data with synth_fraud_crafting.craft_campaign_weekly. Please note that names are hardcoded, but can be extended. The script will grab data from the campaign generated for the whole "data duration", and mix them weekly according to a string. For more information, check the related code.
For example, create a (LT,IS) training set with:
python3 -m synth_fraud_crafting.craft_campaign_weekly llllll0 OBSERVED_CAMPAIGN_NAME_TRAINand a test set with the alternating (LT,IS) and (ST,TH) with:
$ python3 -m synth_fraud_crafting.craft_campaign_weekly 000000lFlFlFlflFlFlF OBSERVED_CAMPAIGN_NAME_TESTLastly, "aggregate" data to bring the dataset in feature space with:
python3 -m preprocessing.aggregation DATASET dataset/fraud_generated/OBSERVED_CAMPAIGN_NAMETo train and select hyperparameters and feature sets of the FDSs, refer to the code in model_selection.
Experimental code can be found in experiments/exp1.py, experiments/exp2.py, experiments/MWU.ipynb.
[1] Tommaso Paladini, Martino Bernasconi de Luca, Michele Carminati, Mario Polino, Francesco Trovò, and Stefano Zanero. "Advancing Fraud Detection Systems Through Online Learning." In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 275-292. Cham: Springer Nature Switzerland, 2023.