Skip to content

TensorFlow Docker example on Ampere GPUs #10

@agitter

Description

@agitter

Our docker/tensorflow_python/ example fails on the A100 servers in CHTC with the error:

2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "test_tensorflow.py", line 41, in <module>
    sess.run(productg.op)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[node MatMul (defined at test_tensorflow.py:22) ]]
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
 Variable/read (defined at test_tensorflow.py:20)
 Variable_1/read (defined at test_tensorflow.py:21)

I resolved the error by switching to the latest TensorFlow Docker image (2.4.1-gpu) and adding the two lines of TensorFlow migration code:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

We need to consider how to update this example. Should we have a TensorFlow 1.x example and a separate 2.x example? Do we need to constrain the servers these examples are all compatible with?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions