"Shift Your Shape: Correlating and Defending Mixnet Flows Based on Their Shapes", IEEE TDSC 2025

Primary repository for our 2025 IEEE TDSC article "Shift Your Shape: Correlating and Defending Mixnet Flows Based on Their Shapes".

Authors: Lennart Oldenburg, Marc Juarez, Enrique Argones Rúa, Claudia Diaz

Caution: This repository requires close to 117 GB of disk space. Make sure you have enough free disk space available before cloning.

Abstract

When the packet rate of flows in a mixnet depends on the amount of transferred data, it is possible to identify which flow entering is which flow exiting the mixnet based on their shapes. We present a passive shape-based flow correlation attack against state-of-the-art mixnet Nym and a systematic evaluation of countermeasures. Assuming an adversary controlling both the entry and exit gateway-requesters selected by users to access the public Internet through Nym, our attack's artificial neural network assigns correlation scores to flow pairs based on traffic distribution similarities to accurately distinguish paired from unpaired flow tuples. From data we collected on the live Nym mixnet, we generate 45 datasets and 119 testing scenarios for different defense configurations. After one minute of attacking flow pairs on default Nym, we achieve a PR-AUC of 0.9998 at a base rate of 1.9 x 10^−4 paired flow tuples. However, (combinations of) the five evaluated defense strategies indicate that the right choice and scale of countermeasure(s) can offer meaningful protection. Our evaluation also informs on the resources overhead spent on defenses. We discuss steps a mixnet such as Nym can take to make our attack both less likely and less accurate.

List of Repositories

Given their different purposes and significant sizes, we split up this article's artifacts into multiple repositories:

(This repository:) Primary Artifact:
- This is the primary artifact of our article (and you are currently looking at it).
- It contains all scripts, dataset and model configuration files, and Python source code to train, evaluate, and present the artificial neural network (ANN) classifiers we created in this work to correlate and defend mixnet flows.
- Caution: Requires 117 GB of disk space.
Flow Metadata Collector:
- Repository containing the mixnet flow metadata collector we built and used to collect this work's datasets on the live Nym network (the remaining datasets used to assess various defenses to our flow correlation attack are generated by preprocessing dataset default at load time when training or testing an ANN).
- Please see the repository's README.md file for detailed instructions on how to use it in order to collect datasets of mixnet flow metadata over Nym in version nym-binaries-v2023.5-rolo, by running its instances on public cloud provider Hetzner (alternative cloud providers possible after minimal manual changes).
- Caution: This repository assumes Nym in version nym-binaries-v2023.5-rolo, which by the time you want to use it may have diverged significantly enough from deployed ("live") Nym that it will not work (without updates or at all).
Raw Dataset default:
- Holds the raw experiment result files of experiment default.
- Caution: Requires 50 GB of disk space.
Raw Dataset high_delay:
- Holds the raw experiment result files of experiment high_delay.
- Caution: Requires 58 GB of disk space.
Raw Dataset large_pkts:
- Holds the raw experiment result files of experiment large_pkts.
- Caution: Requires 32 GB of disk space.
Ready Dataset default:
- Holds the ready version of dataset default, ready for training and testing ANN classifiers on (see below).
- Caution: Requires 10 GB of disk space.
Ready Dataset high_delay:
- Holds the ready version of dataset high_delay, ready for training and testing ANN classifiers on (see below).
- Caution: Requires 13 GB of disk space.
Ready Dataset large_pkts:
- Holds the ready version of dataset large_pkts, ready for training and testing ANN classifiers on (see below).
- Caution: Requires 6 GB of disk space.
Mixnet Flow Visualizations:
- Contains traffic shape visualizations of (a subset of) the flows we collected on Nym in version nym-binaries-v2023.5-rolo
- Caution: Requires 1.5 GB of disk space.

Situating the Repositories

If you intend to re-take all steps we took for our article, you can use our Flow Metadata Collector to deploy the modifications to Nym's mixnet endpoints that you want to evaluate and collect the datasets you are interested in, e.g., the three raw datasets linked above for the three experimental settings we wrote patches to Nym for in our flow correlator repository. After you have collected your raw datasets, you can again use the scripts and Jupyter Notebooks from our flow correlator repository to turn the raw datasets into their ready version, suitable for training and testing of ANN classifiers such as the ones in this (current) repository.

If you intend to re-run or modify the ANN-based flow correlation attack and defense evaluations from this (current) repository, you would modify or extend the model configurations (determining the ANN classifiers' parameters such as convolutional, pooling, dropout parameters) or dataset configurations (determining dataset parameters such as number of correlated flows to use, how to preprocess the flows upon loading, how many negatives to construct per positive) and accordingly update the training and testing scripts in this repository, and then run these scripts in a capable compute environment (we used NVIDIA P100 cards with 16 GB of memory for our evaluations).

Finally, if you intend to reproduce the figures we present in the paper from the datasets and models we obtained, take a look at the ./paper_figures folder, which contains the Python programs we used to create the article figures.

Usage of this Repository (Training and Evaluating the ANN Classifiers)

On a compute environment suitable for running machine learning tasks (called cluster below), set up file system structure and clone all relevant repositories under your preferred non-root user account (called user below):

user@cluster $   SHIFTYOURSHAPE_ROOT=""   # TODO: INSERT YOUR DESIRED PATH TO PARENT DIRECTORY FOR ALL FOLLOWING STEPS (WITHOUT TRAILING SLASH), e.g.: SHIFTYOURSHAPE_ROOT="/tmp/shift-your-shape"
user@cluster $   mkdir -p "${SHIFTYOURSHAPE_ROOT}/datasets"
user@cluster $   cd "${SHIFTYOURSHAPE_ROOT}"
user@cluster $   git clone https://github.com/KULeuven-COSIC/shift-your-shape_correlating-and-defending-mixnet-flows-based-on-their-shapes.git
... This will take some time ...
user@cluster $   cd "${SHIFTYOURSHAPE_ROOT}/datasets"
user@cluster $   git clone https://github.com/KULeuven-COSIC/shift-your-shape_ready_exp01-default-nym-binaries-v2023.5-rolo_2024-04-09.git
... This will take some time ...
user@cluster $   git clone https://github.com/KULeuven-COSIC/shift-your-shape_ready_exp02-higher-mix-delay-nym-binaries-v2023.5-rolo_2024-04-15.git
... This will take some time ...
user@cluster $   git clone https://github.com/KULeuven-COSIC/shift-your-shape_ready_exp03-larger-packet-size-nym-binaries-v2023.5-rolo_2024-04-29.git
... This will take some time ...

Make sure to replace the paths relevant to our system, experiments, and thus log files with yours:

user@cluster $   cd "${SHIFTYOURSHAPE_ROOT}/shift-your-shape_correlating-and-defending-mixnet-flows-based-on-their-shapes/scripts"
user@cluster $   sed -i 's@/staging/leuven/stg_00162/mixnet-shape-correlation@'"${SHIFTYOURSHAPE_ROOT}"'/shift-your-shape_correlating-and-defending-mixnet-flows-based-on-their-shapes@g' *.sh
user@cluster $   sed -i 's@/staging/leuven/stg_00162@'"${SHIFTYOURSHAPE_ROOT}"'@g' *.sh
user@cluster $   grep -HIrin 'stg_00162' .
... Verify that the last command returns no matches (if it did, update them, too) ...
user@cluster $   cd "${SHIFTYOURSHAPE_ROOT}/shift-your-shape_correlating-and-defending-mixnet-flows-based-on-their-shapes/configs_datasets"
user@cluster $   sed -i 's@/staging/leuven/stg_00162@'"${SHIFTYOURSHAPE_ROOT}"'@g' *.json
user@cluster $   grep -HIrin 'stg_00162' .
... Verify that the last command returns no matches (if it did, update them, too) ...

Install Miniconda for Python package dependencies and install required Python packages:

user@cluster $   cd "${SHIFTYOURSHAPE_ROOT}"
user@cluster $   ./scripts/vsc_1_setup_01_miniconda.sh
user@cluster $   ./scripts/vsc_1_setup_02_install-packages.sh

At this point, setup is complete and you are ready to run the actual machine learning jobs that will tune, train, validate, and test the ANN classifiers to correlate and defend mixnet flows. Note that these jobs will take between many hours to many days to complete and assume the SLURM scheduler as handler.

You can find the runtimes we set for each individual job as part of the job script's #SBATCH metadata lines at the top of each script. Please make sure to adjust or remove these #SBATCH lines before running any script in ./scripts that is neither vsc_1_setup_01_miniconda.sh nor vsc_1_setup_02_install-packages.sh.

You can tune the hyperparameters of both the correlator as well as the generator model (please refer to our paper for information on what these networks represent) or you can skip this step if you intend to re-use the parameters we obtained during our tuning process (took circa one week):

user@cluster $   ./scripts/vsc_2_tune_exp01-default_gen-nadam_discr-nadam.sh

You may not have to run the other tuning script (vsc_2_tune_resumed_exp01-default_discr-nadam.sh), as only us needed to do this due to an error that aborted our initial tuning script.

Tuning script vsc_2_tune_exp01-default_gen-nadam_discr-nadam.sh will by default write its outputs into folder ./models_tuned. You can find the logs of our tuning runs in that folder.

After tuning and potentially adjusting model parameters in ./configs_models based on your tuning results, you can start training and validating ANN classifiers on the three collected datasets as well as on any dataset generated by preprocessing one of the three collected datasets to represent a specific defense (please refer to our article for details):

user@cluster $   ./scripts/vsc_3_train_01_exp01-default_no-splittun_no-gen.sh    # This trains and validates a correlator over dataset default
... If you use SLURM, the above command will return immediately, but the job will take many hours to complete ...
user@cluster $   ./scripts/vsc_3_train_16_exp01-default_splittun-4_gen-5082.sh   # This trains and validates a correlator and a generator over dataset split4_inj2.0
... This job will again take many hours to complete ...

As we discuss in our article, we generate a large number of defense scenario datasets by preprocessing collected dataset default (ready_exp01-default). For above script vsc_3_train_16_exp01-default_splittun-4_gen-5082.sh, take a look at dataset.py, ./configs_datasets/train_exp01-default_splittun-4.json, ./configs_models/discr.json, ./configs_models/gen_inj_b-5082.json, and train_generator_tandem_discriminator.py to understand how we preprocess dataset default to obtain dataset split4_inj2.0 and train and validate both models on it.

For the trained models, training/validation performance figures, and log files from our training and validation runs, please see the respective experiment folder in ./models_trained.

After training and validation, you can assess the ability of a trained correlator and/or a trained generator at correlating or defending mixnet flows on an unseen dataset subset (the testing dataset) by running the desired testing script:

user@cluster $   ./scripts/vsc_4_test_01_exp01-default_no-splittun_no-gen.sh    # This tests the default-trained correlator on the unseen test subset of dataset default
... This job will take a few hours to complete ...
user@cluster $   ./scripts/vsc_4_test_16_exp01-default_splittun-4_gen-5082.sh   # This tests two scenarios: split4_inj2.0-trained correlator against split4_inj2.0-trained generator on split4_inj2.0 as well as default-trained correlator against split4_inj2.0-trained generator on split4_inj2.0
... This job will take a few hours to complete ...

Again, please take a look at the Python and configuration files that the respective script references to understand how the final attack-defense testing scenario is constructed. Also refer to the other vsc_4_test_*.sh scripts to see how we end up with the 119 distinct testing scenarios we report in our article.

By default, results from the testing stage will be written to a dedicated folder in ./models_tested. Here, you can also find the performance files, figures, and log files we obtained during our testing runs and report in our paper.

You can produce the performance figures we include in our training/validation and testing output folders on your own experimental results by running plot_train_curves.py and plot_test_curves.py, respectively:

user@cluster $   conda activate mixshapecorr
user@cluster $   python plot_train_curves.py --results_dir ./models_trained/<YOUR_TRAINING_RESULTS_DIR>
user@cluster $   python plot_test_curves.py --results_dir ./models_tested/<YOUR_TESTING_RESULTS_DIR>

BibTEX Entry

We will provide a bibtex entry to cite our article once it has been published.

To be added...

Licensing

This repository is licensed under GPLv3.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.testdata/ready_exp01-default-nym-binaries-v2023.5-rolo_2024-01-02		.testdata/ready_exp01-default-nym-binaries-v2023.5-rolo_2024-01-02
configs_datasets		configs_datasets
configs_models		configs_models
meta_results		meta_results
models_tested		models_tested
models_trained		models_trained
models_tuned		models_tuned
paper_figures		paper_figures
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader_test.py		dataloader_test.py
dataset.py		dataset.py
dataset_test.py		dataset_test.py
helpers.py		helpers.py
log_chosen_idx_training_dataset.py		log_chosen_idx_training_dataset.py
log_number_of_packets_training_dataset.py		log_number_of_packets_training_dataset.py
model_flowshapecorrelator.py		model_flowshapecorrelator.py
model_flowshapecorrelator_no_acks.py		model_flowshapecorrelator_no_acks.py
model_generator.py		model_generator.py
plot_test_curves.py		plot_test_curves.py
plot_train_curves.py		plot_train_curves.py
plot_undefended_vs_defended_flows.py		plot_undefended_vs_defended_flows.py
requirements.txt		requirements.txt
test_discriminator_only.py		test_discriminator_only.py
test_generator_discriminator.py		test_generator_discriminator.py
test_remover_injector_discriminator.py		test_remover_injector_discriminator.py
train_discriminator_fixed_generator.py		train_discriminator_fixed_generator.py
train_discriminator_only.py		train_discriminator_only.py
train_generator_fixed_discriminator.py		train_generator_fixed_discriminator.py
train_generator_tandem_discriminator.py		train_generator_tandem_discriminator.py
train_remover_injector_tandem_discriminator.py		train_remover_injector_tandem_discriminator.py
tune.py		tune.py
tune_resumed_after_error.py		tune_resumed_after_error.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"Shift Your Shape: Correlating and Defending Mixnet Flows Based on Their Shapes", IEEE TDSC 2025

Abstract

List of Repositories

Situating the Repositories

Usage of this Repository (Training and Evaluating the ANN Classifiers)

BibTEX Entry

Licensing

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

KULeuven-COSIC/shift-your-shape_correlating-and-defending-mixnet-flows-based-on-their-shapes

Folders and files

Latest commit

History

Repository files navigation

"Shift Your Shape: Correlating and Defending Mixnet Flows Based on Their Shapes", IEEE TDSC 2025

Abstract

List of Repositories

Situating the Repositories

Usage of this Repository (Training and Evaluating the ANN Classifiers)

BibTEX Entry

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages