Evaluation

During training you can monitor your experiments with Tensorboard.
We also try to provide some useful functionality to quickly evaluate and compare the results of your experiments.

One can use the evaluate_experiment.py to get a quick first impression of a finished experiment run.

Plotting vs. raw data

You can use the plotting pipeline with your customized setting (as shown in the usage examples). Alternatively you can use the script to export your data to a .csv and process the data to your own needs.

In this scenario, set evaluation.output_types: [csv] # no plotting, just the data in your experiment yaml.

Usage Examples

In the following you can find 4 example use cases for experiments and how to visualize the results as heatmaps.

testing an optimizer on a task
comparing two optimizers on the same task
comparing multiple optimizers on different tasks
comparing the influence of a single hyperparameter

Here we want to focus on the plotting. For instructions on how to run experiments, refer to the main README. To get started right away, we provide the data for this example. If you want to reproduce it, refer to this section.

Plotting the experiment

By default, calling the run_experiment.py will plot the experiment after training and testing. To disable, set engine.plot=false.
To plot your experiment afterwards, call the evaluate_experiment.py with the same experiment yaml. To adjust how to plot, change the values under the evaluation key of the experiment. Take a look at the evaluation/default.yaml to see which settings are available. Some of these keys are explained in the examples below to give the reader a first impression. Note that some default parameters are set in the respective tasks (e.g. in tasks/mnist/default.yaml).

Example use cases

Here are some example scenarios to give you an understanding of how our plotting works. Run the commands from the root of the repository. Take a look at the yaml files used in the command to see what is going on.

Example 1

This example is a good starting point; it shows the performance of a single default optimizer on one of the tasks. Experiment file: examples/plotting/1_mnist-adamw.yaml

python -m pytorch_fob.evaluate_experiment examples/plotting/1_mnist-adamw.yaml

This example uses only the final model performance and only creates the plot as png.

Helpful settings:

checkpoints: [last] # you could use [last, best] to additionaly plot the model with the best validation
output_types: [png] # you could use [pdf, png] to also create a pdf

Example 2

You can compare two different optimizers.
Experiment file: examples/plotting/2_adamw-vs-sgd.yaml

python -m pytorch_fob.evaluate_experiment examples/plotting/2_adamw-vs-sgd.yaml

Helpful settings:

plot.x_axis: [optimizer.weight_decay, optimizer.kappa_init_param] # the values given here are used as the value for the axis. The order in the list is used from left to right for the plot columns
column_split_key: optimizer.name This creates a column for each different optimizer (default behavior). You can set this to null to disable columns or choose a different key.

Example 3

There are multiple tasks in the benchmark, this example shows how to get a quick overview over multiple at the same time.
Experiment file: examples/plotting/3_mnist-and-tabular_adamw-vs-sgd.yaml

python -m pytorch_fob.evaluate_experiment examples/plotting/3_mnist-and-tabular_adamw-vs-sgd.yaml

Helpful settings:

split_groups: ["task.name"]

Every non unique value for each parameter name in split_groups will create its own subplot. Instead of a list you can set to false to disable splitting or true to split on every parameter that is different between runs (except those already in column_split_key or aggregate_groups). This list is useful if there are just a few parameters you want to split.

Example 4

Any parameter that is neither on the x-axis nor y-axis will either be aggregated over or split into subplots. Any individual square of a heatmap shows the mean and std over multiple runs (as seen in the previous plots). Here we show how to choose the runs to aggregate.
Experiment file: examples/plotting/4_adamw-vs-sgd_seeds.yaml

python -m pytorch_fob.evaluate_experiment examples/plotting/4_adamw-vs-sgd_seeds.yaml

Helpful settings:

Control the std with
- plot.std # toggle off with False
- plot.aggfunc: std # also try var
control the rows with
- split_groups: ["engine.seed"]
- aggregate_groups: []

Per default the plot will display the mean and std calculated over the seeds. We need to remove the seed from the aggregate_groups list (by giving an empty list instead). This list is useful if there are additional parameters you want to aggregate over.

Reproducing the Data

Lets create some data that we can plot; from the root directory call:

Data Download

first we make sure the data is already downloaded beforehand:

python -m pytorch_fob.dataset_setup examples/plotting/3_mnist-and-tabular_adamw-vs-sgd.yaml

This will download the mnist data (required for 1-4) and tabular (required for 3) into the examples/data directory - path can be changed in the corresponding yaml you want to use (e.g. examples/plotting/1_mnist-adamw.yaml if you have already set up your benchmark).

Estimated disk usage for the data: ~65M

Training

The 2 tasks will be run on 2x2 hyperparameter on 2 different seeds per optimizer for a total of 32 runs.

python -m pytorch_fob.run_experiment examples/plotting/3_mnist-and-tabular_adamw-vs-sgd.yaml

After training finished you should find 32 run directories in examples/plotting/outputs

All parameters that differ from the default value are noted in the directory name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation

Plotting vs. raw data

Usage Examples

Plotting the experiment

Example use cases

Example 1

Example 2

Example 3

Example 4

Reproducing the Data

Data Download

Training

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation

Plotting vs. raw data

Usage Examples

Plotting the experiment

Example use cases

Example 1

Example 2

Example 3

Example 4

Reproducing the Data

Data Download

Training