Skip to content

Add a section on reproducibility to the docs #61

@hagenw

Description

@hagenw

The results you get back when running a model can depend on the device, and can even vary across several calls on the same device. It might be a good idea to add a "Reproducibility" section to the documentation in which we discuss these issues.

For example, let us use the model introduced in w2v2-how-to:

import audeer
import audonnx
import numpy as np


url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)

np.random.seed(1)
sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)

Now, let us execute the model on the CPU:

>>> model = audonnx.load(model_root, device='cpu')
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)

When using the CPU we always get back the same result,
when executing it multiple times.

Then let's switch to the GPU:

>>> model = audonnx.load(model_root, device='cuda:0')
>>> model(signal, sampling_rate)['logits']
array([[0.68319285, 0.64667934, 0.49738473]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.68317926, 0.6466613 , 0.4974225 ]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.683162  , 0.64668435, 0.4973961 ]], dtype=float32)

We see that we get different results after the fifth decimal place for each run,
and the average result deviates from the CPU based result by:

array([[-2.62856483e-05, -5.79953194e-05, -1.06304884e-04]], dtype=float32)

This is a known ONNX limitation (microsoft/onnxruntime#9704).
In microsoft/onnxruntime#4611 (comment) they propose to select a fixed convolution algorithm to improve this behavior, see also https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking.
With audonnx we can achieve this by

>>> providers = [("CUDAExecutionProvider", {'cudnn_conv_algo_search': 'DEFAULT'})]
>>> model = audonnx.load(model_root, device=providers)
>>> model(signal, sampling_rate)['logits']
array([[0.683191  , 0.64670646, 0.4973919 ]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6830938 , 0.6466217 , 0.49734592]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6831656 , 0.64666504, 0.497427  ]], dtype=float32)

It does not really improve results.

It seems that we can only recommend the following when reproducibility is desired:

  • use CPU as device
  • limit the outcome of the model to two decimal places, e.g. array([[0.68, 0.65, 0.50]], dtype=float32)

/cc @audeerington

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions