TensorFlow Docker example on Ampere GPUs

Our docker/tensorflow_python/ [example ](https://github.com/CHTC/templates-GPUs/blob/992b2c7423af8b46faa627b4f198310e1bc0b0c8/docker/tensorflow_python/test_tensorflow.py) fails on the A100 servers in CHTC with the error:
```
2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "test_tensorflow.py", line 41, in <module>
    sess.run(productg.op)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[node MatMul (defined at test_tensorflow.py:22) ]]
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
 Variable/read (defined at test_tensorflow.py:20)
 Variable_1/read (defined at test_tensorflow.py:21)
```

I resolved the error by switching to the latest TensorFlow Docker image ([2.4.1-gpu](https://hub.docker.com/layers/tensorflow/tensorflow/2.4.1-gpu/images/sha256-03e706e09b0425bb4f634a644c5d869f3b6d6c027411ccca14c18719121d3064?context=explore)) and adding the two lines of TensorFlow [migration code](https://www.tensorflow.org/guide/migrate):
```
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
```

We need to consider how to update this example.  Should we have a TensorFlow 1.x example and a separate 2.x example?  Do we need to constrain the servers these examples are all compatible with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorFlow Docker example on Ampere GPUs #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorFlow Docker example on Ampere GPUs #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions