Authors: Marine NEYRET (Institut Louis Bachelier DataLab), Pierre SEVESTRE (Société Générale, DataLab IGAD)
For more details, see the ArXiv's version of the paper
In this work, we propose an experimental pipeline to challenge the robustness of Graph Representation Learning (GRL) models to properties encountered on transactional graphs from a bank perspective. The bank does not have a complete view of all transactions being performed on the market, and the resulting subgraph is sparser than the complete graph of transactions, restricted to transactions involving one of its clients, resulting in the following specificities:
- Graph sparsity A large part of the transactions occuring between non-clients are not available from the bank point of view;
- Asymmetric node information More attributes are available for the clients than for non-clients.
To analyze the behavior of GRL models when dealing with transactional graph specificities, two preprocessors are created to deteriorate a full graph to match either graph sparsity or asymmetric node information characteristic.
While results are reported on four commonly used graph in the litterature (Pubmed, CoAuthor CS, Amazon Computer and Reddit), the pipeline can easily be used on other graphs.
Preprocessors details:
The sampler processor deals with graph sparsity:
- Sample a portion of nodes that will act as client nodes,
- Remove any edge not connected to this subset of nodes.
The features_sampler handles asymmetric node information:
- Sample a portion of nodes that will act as client nodes, the remaining ones being non-client nodes,
- Remove any edge not connected to this subset of nodes,
- Deteriorate a portion of node features for non-clients nodes.
Parameters:
sampling_ratio
: portion of nodes to be considered as clients,features_sampling_ratio
(only for features_sampler): portion of node features to be deteriorated.
You can create your own preprocessor to experiment any other graph characteristic in preprocessor.py
by inheriting from the Preprocessor
class. You must define the internal function _process
that performs the graph modifications.
Implemented and tested on the following environment:
- Linux (RHEL v7.9) with Tesla V100-PCIE-16GB
- Python 3.7
- Pytorch 1.5.1 and DGL 0.6.1
To use this experimental pipeline:
- Clone the repository
- Install the dependencies. The required Python libraries are listed in
requirements.txt
: runpip install -r requirements.txt
To run a GCN model (config defined in config/gcn.json
) for node classification on the full Pubmed dataset:
python main.py --config gcn --dataset pubmed --rep 1
Expected output:
INFO | 28-Jan-22 17:00:20 | Computing test metrics..
INFO | 28-Jan-22 17:00:20 | Accuracy: 0.786, Precision macro: 0.778, Recall macro: 0.788, F1 macro: 0.781
To run a GraphSage model (config defined in config/graphsage_lp.json
) for link prediction on the sampled Pubmed dataset (with 0.5 ratio):
python main_lp.py --config graphsage_lp --dataset pubmed --preprocessor sampler --sampling_ratio 0.5 --rep 1
Expected output:
INFO | 28-Jan-22 17:18:00 | Computing test metrics..
INFO | 28-Jan-22 17:18:00 | Precision lp: 0.542, Recall lp: 0.951, Accuracy lp: 0.573, F1 lp: 0.690, Auc lp: 0.741, Ap lp: 0.701, Nb removed edges: 00000, Accuracy removed edges: 00000
3 main files are available to run experiments for the different training procedure and downstream tasks, consisting in:
load_data
: starting from a DGL dataset, applies the preprocessor if needed, computes the train/validation/test masks, and stores the usefull data for training and evaluation into a dictionnary,main
: trains one model and evaluates its performances.
Results, logging, checkpoints models, source code and config are automatically saved according to the path registered in the configuration file.
The most important argument of main files is config: the configuration file to run experiment from, stored in the config
folder.
Each experiment is associated with a configuration file describing its characteristics. The configuration files should be saved in json
format under ./configs/<config_name>.json
. The description of such configuration is available below.
Config description
{
"dataset": {
"name": "<name>" // Name of the dataset to run experiments on, from AVAILABLE_DATASET in utils/main_utils.py
},
"preprocessor":{
"name": "<name>", // Preprocesor name, from preprocessor_dict in utils/main_utils.py or null
// for baseline
"params": {
"ratio": float, // Portion of client nodes
"method": "<mehod_name>", // Features deterioration method, either 'zeros', 'random',
// 'drop' or 'imputation'.
"ratio_features": float // Portion of features to deteriorate
}
},
"experiment": {
"ghost": false,
"output_folder": "<name>", // Name of the folder in which experiments outputs are stored
"use_saved_datadict": bool, // True to reuse preprocessed data for identical configuration
"ckpt_save_interval": 1000, // Checkpoint interval
"repetitions":10, // Number of independant model trainings and evaluation
"val_metrics_interval": 2, // Interval between two validation metrics computations
"monitor_training": { // For advance training monitoring
"save_weight": false,
"save_grad": false,
"save_interval": 1
},
"task": {
"name": "<task_name>", // Name of the downstream task, either 'node_classification'
// or 'link_prediction'
"params": {// Parameters associated with task. See example configurations provided.
}
}
},
"model": {
"name": "<model_name>", // Name of the model, from AVAILABLE_MODEL in utils/main_utils.py
"features": true, // Whether model requires node features or not
"params": {// Parameters of the model, must be available in model.py class definition
}
},
"training":{
"batch_size": int,
"n_epochs": int,
"loss":{
"use_weights": bool
},
"optimizer":{
"lr": float,
"weight_decay": float
},
"early_stopping":{
"patience": int,
"delta": float
}
},
"evaluation":{// Downstream task for link_prediction
"model": {
"name": "log_reg",
"params":{
"penalty": "none",
"solver":"newton-cg"
}
}
}
}
-
Script:
main.py
-
Available models and corresponding config:
Implemented for the models below, but can be easily extended to any
nn.Module
with appropriateforward
methodModel name Config name GCN gcn
GAT gat
GraphSage graphsage
$ python main.py --help
usage: main.py [-h] [--verbose VERBOSE] [--config CONFIG] [--tag TAG]
[--gpu GPU] [--dataset DATASET] [--rep REP]
[--preprocessor {sampler,features_sampler}]
[--sampling_ratio SAMPLING_RATIO]
[--features_sampling_ratio FEATURES_SAMPLING_RATIO]
optional arguments:
-h, --help show this help message and exit
--verbose VERBOSE Logging level and other information. Higher value
returns more specific information
--config CONFIG Name of the configuration file to run experiment from,
without .json extension
--tag TAG Specify tag to recognize experiment easily
--gpu GPU cpu: <0, gpu: >=0
--dataset DATASET Dataset name
--rep REP Number of repetitions of the experiment
--preprocessor {sampler,features_sampler}
Specify preprocessor. Either 'sampler', or
'features_sampler'
--sampling_ratio SAMPLING_RATIO
Sampling ratio to identify the portion of client nodes
--features_sampling_ratio FEATURES_SAMPLING_RATIO
Sampling ratio for nodes features deterioration
-
Script:
main_lp.py
-
Available models and corresponding config:
Implemented for the models below, but can be easily extended to any
nn.Module
with appropriateforward
methodModel name Config name GCN gcn_lp
GAT gat_lp
GraphSage graphsage_lp
python main_lp.py --help
usage: main_lp.py [-h] [--verbose VERBOSE] [--config CONFIG] [--tag TAG]
[--gpu GPU] [--dataset DATASET] [--rep REP]
[--preprocessor {sampler,features_sampler}]
[--sampling_ratio SAMPLING_RATIO]
[--features_sampling_ratio FEATURES_SAMPLING_RATIO]
optional arguments:
-h, --help show this help message and exit
--verbose VERBOSE Logging level and other information. Higher value
returns more specific information
--config CONFIG Name of the configuration file to run experiment from,
without .json extension
--tag TAG Specify tag to recognize experiment easily
--gpu GPU cpu: <0, gpu: >=0
--dataset DATASET Dataset name
--rep REP Number of repetitions of the experiment
--preprocessor {sampler,features_sampler}
Specify preprocessor. Either 'sampler', or
'features_sampler'
--sampling_ratio SAMPLING_RATIO
Sampling ratio to identify the portion of client nodes
--features_sampling_ratio FEATURES_SAMPLING_RATIO
Sampling ratio for nodes features deterioration
-
Script:
main_unsup.py
-
Available models and corresponding config:
Model name Config name DeepWalk deepwalk
/deepwalk_lp
Node2Vec node2vec
/node2vec_lp
-
Dowstream task: to be specified using
--task
argument
python main_unsup.py --help
usage: main_unsup.py [-h] [--verbose VERBOSE] [--config CONFIG] [--tag TAG]
[--gpu GPU] [--dataset DATASET] [--rep REP]
[--task {link_prediction,node_classification}]
[--preprocessor {sampler,features_sampler}]
[--sampling_ratio SAMPLING_RATIO]
[--features_sampling_ratio FEATURES_SAMPLING_RATIO]
optional arguments:
-h, --help show this help message and exit
--verbose VERBOSE Logging level and other information. Higher value
returns more specific information
--config CONFIG Name of the configuration file to run experiment from,
without .json extension
--tag TAG Specify tag to recognize experiment easily
--gpu GPU cpu: <0, gpu: >=0
--dataset DATASET Dataset name
--rep REP Number of repetitions of the experiment
--task {link_prediction,node_classification}
Downstream task to use embeddings for. Either
'link_prediction' or 'node_classification'
--preprocessor {sampler,features_sampler}
Specify preprocessor. Either 'sampler', or
'features_sampler'
--sampling_ratio SAMPLING_RATIO
Sampling ratio to identify the portion of client nodes
--features_sampling_ratio FEATURES_SAMPLING_RATIO
Sampling ratio for nodes features deterioration
You could use gridsearch.py
script to perform hyperparameter search on a dataset. Example below for GCN on Pubmed:
python gridsearch.py --training_process sup --config gcn --gs_config gs_gcn --dataset pubmed
A second configuration file is needed to gridsearch parameters, providing the lists of parameters to search on. The description of such configuration is available below.
Gridsearch config example
{
"model": {
"params": { // Parameters of the model, must be available in model.py class definition
"<name_parameter>": [value, value, value],
...
}
},
"training": {// Parameters for training
"<param_categ>":{
"<name_parameter>": [value, value, value],
...
}
}
}
You could use launch.py
to run one model on several datasets and for several prepocessors. To run all the experiments for a GCN model:
python launch.py --script main.py --config gcn
You can find a Jupyter notebook, in the notebook folder, to:
- merge all the results within an excel file,
- plot graphs to see performance evolution with respect to sampler or features_sampler.
This research was conducted within the "Data Science for Banking Audit" research program, under the aegis of the Europlace Institute of Finance, a joint program with the General Inspection at Société Générale.
Under the terms of the BSD 2-Clause License.
Third Party Libraries
Component | Version | License |
---|---|---|
nsim | 3.8.3 | GNU LGPL 2.1 |
GPUtil | 1.4.0 | MIT |
karateclub | 1.0.21 | MIT |
matplotlib | 3.3.2 | Python Software Foundation |
networkx | 2.5 | BSD |
node2vec | 0.4.0 | MIT |
nodeventors | 0.1.23 | MIT |
numpy | 1.19.0 | BSD-3 |
pandas | 1.1.5 | BSD-3 |
pyarrow | 0.13.0 | Apache v2 |
pyrsistent | 0.16.0 | MIT |
scikit-learn | 0.23.2 | BSD-3 |
torch | 1.5.1+cu101 | BSD-3 |
torchvision | 0.6.1+cu101 | BSD |
tqdm | 4.50.2 | MIT License, Mozilla Public License 2.0 (MPL 2.0) (MPLv2.0, MIT Licences) |
dgl-cu101 | 0.6.1 | Apache Software License (APACHE) |