-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
Our docker/tensorflow_python/ example fails on the A100 servers in CHTC with the error:
2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
[[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_tensorflow.py", line 41, in <module>
sess.run(productg.op)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
[[node MatMul (defined at test_tensorflow.py:22) ]]
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
Variable/read (defined at test_tensorflow.py:20)
Variable_1/read (defined at test_tensorflow.py:21)
I resolved the error by switching to the latest TensorFlow Docker image (2.4.1-gpu) and adding the two lines of TensorFlow migration code:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
We need to consider how to update this example. Should we have a TensorFlow 1.x example and a separate 2.x example? Do we need to constrain the servers these examples are all compatible with?
Metadata
Metadata
Assignees
Labels
No labels