Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU doesn't seem to work #29

Open
fengredrum opened this issue Mar 8, 2018 · 9 comments
Open

GPU doesn't seem to work #29

fengredrum opened this issue Mar 8, 2018 · 9 comments

Comments

@fengredrum
Copy link

fengredrum commented Mar 8, 2018

I've set use_gpu = True, but the GPU useage is almost close to zero when running the code. When I look into tensorboard, it shows that all operations are assigned to CPU. Then I disable sess_config = tf.ConfigProto(allow_soft_placement=True) and force it running on GPU, the system console throws an error as:
`INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\Code\PythonScripts\DeepRL\BatchPPO\20180308T091941-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.
2018-03-08 09:19:41.315004: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-08 09:19:41.595863: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.64GiB
2018-03-08 09:19:41.596493: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Graph contains 42003 trainable variables.
2018-03-08 09:19:57.811479: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
Traceback (most recent call last):
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 127, in train
utility.initialize_variables(sess, saver, config.logdir)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 116, in initialize_variables
tf.global_variables_initializer()))
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 889, in run
run_metadata_ptr)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]

Caused by op 'ppo_temporary/episodes/Variable', defined at:
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 113, in train
batch_env, config.algorithm, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 48, in define_simulation_graph
algo = algo_cls(batch_env, step, is_training, should_log, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\algorithm.py", line 78, in init
template, len(batch_env), config.max_length, 'episodes')
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\memory.py", line 44, in init
self._length = tf.Variable(tf.zeros(capacity, tf.int32), False)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 213, in init
constraint=constraint)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 331, in _init_from_args
name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\state_ops.py", line 133, in variable_op_v2
shared_name=shared_name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 926, in _variable_v2
shared_name=shared_name, name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]`

It seems that tensorflow does not allow assign an int type variable on GPU.

@fengredrum
Copy link
Author

BTW, it runs on Windows 10, and the version of tensorflow is 1.4

@danijar
Copy link
Contributor

danijar commented Apr 10, 2018

Hi @fengredrum. In case this is still an issue, could you try wrapping your network implementation in a with tf.device('/gpu:0') block?

@fengredrum
Copy link
Author

The neural network is assigned to GPU, which I've checked in TensorBoard. The problem occurs in agent/ppo/memory.py. Cause self._length is a int32 type variable. I try to initialize it as a tf.float32 type variable then using tf.to_int32 bypass this problem. The code works fine on CPU. However, When implemented on GPU, it seems doesn't learn anything. Maybe there are some elusive bug in tensorflow? LOL

@danijar
Copy link
Contributor

danijar commented Apr 25, 2018

Thanks for providing more details. I don't think the replay buffer should be placed on GPU, since it can grow quite large, especially when training from pixel observations. All ops should default to CPU because of the with tf.device('/cpu:0') block in train.py, and the network an RNN states should be specifically assigned to GPU inside ppo.py. Is that not what is happening for you?

@colinskow
Copy link

I tried running the default pendulum trainer. When I turn use_gpu on, it freezes during step 0 with no error. It runs fine otherwise. TF runs fine with my GPU on other operations.

Tensorflow v1.8, Ubuntu 18.04, Nvidia GTX 1080, Cuda 9.0.

If my understanding is correct, Tensorflow will automatically give priority to GPU on supported ops. Perhaps the use_gpu option should be removed if it doesn't work.

@fengredrum
Copy link
Author

I agree with your opinion, the transitions might be very large when training from pixel observations. I used to think storing them in GPU memory to alleviate the communication cost between CPU and GPU. However, It turns out that it may not be the optimal solution when training in a single machine. Thank you for your constructive view.

@danijar
Copy link
Contributor

danijar commented Jul 14, 2018

@colinskow Could you try running without environment processes (--noenv_processes), please? When there is a crash in one of the processes it can cause the program to deadlock before anything is printed.

@fengredrum The collected episodes should be stored on CPU memory in almost all scenarios. Is this not the case? We have the config options batch_size and chunk_length if you don't want to train on the full batch of episodes in order to fit network activations on the GPU.

@fengredrum
Copy link
Author

Yes, it works exactly as your description. I'm implementing Batch-PPO based on my understanding. I've added several DL tricks on it to improve stability and performance. Still debugging, hope it can achieve or even beyond your score, LOL.

@Timen
Copy link

Timen commented Aug 20, 2018

I am currently trying to perform training using the GPU as-well, however I am experiencing different issues than those listed above. My issue is that at a certain step in training I believe when it is going to update the global network. When this happens and I run on CPU everything is fine, I get some kl cutoff prompts and then it continues. However when I run on GPU everything just stops, CPU usage goes to zero, GPU usage stays at zero and no prompts in the terminal or anything else happening (for over half an hour after which i gave up).

What I changed:
I added my own custom environment and network

What I tried:
--noenv_processes argument, did not change anything still no activity
--switch to using default feed forward categorical, no change
--checkout clean clone and run pendulum config with gpu enabled, after the first Phase train prompt for step 0 no more output and no more cpu or gpu activity.

What I'm running on:
ubuntu 16.04
tensorflow 1.10
cuda 9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants