-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU doesn't seem to work #29
Comments
BTW, it runs on Windows 10, and the version of tensorflow is 1.4 |
Hi @fengredrum. In case this is still an issue, could you try wrapping your network implementation in a |
The neural network is assigned to GPU, which I've checked in TensorBoard. The problem occurs in |
Thanks for providing more details. I don't think the replay buffer should be placed on GPU, since it can grow quite large, especially when training from pixel observations. All ops should default to CPU because of the |
I tried running the default pendulum trainer. When I turn Tensorflow v1.8, Ubuntu 18.04, Nvidia GTX 1080, Cuda 9.0. If my understanding is correct, Tensorflow will automatically give priority to GPU on supported ops. Perhaps the |
I agree with your opinion, the transitions might be very large when training from pixel observations. I used to think storing them in GPU memory to alleviate the communication cost between CPU and GPU. However, It turns out that it may not be the optimal solution when training in a single machine. Thank you for your constructive view. |
@colinskow Could you try running without environment processes ( @fengredrum The collected episodes should be stored on CPU memory in almost all scenarios. Is this not the case? We have the config options |
Yes, it works exactly as your description. I'm implementing Batch-PPO based on my understanding. I've added several DL tricks on it to improve stability and performance. Still debugging, hope it can achieve or even beyond your score, LOL. |
I am currently trying to perform training using the GPU as-well, however I am experiencing different issues than those listed above. My issue is that at a certain step in training I believe when it is going to update the global network. When this happens and I run on CPU everything is fine, I get some kl cutoff prompts and then it continues. However when I run on GPU everything just stops, CPU usage goes to zero, GPU usage stays at zero and no prompts in the terminal or anything else happening (for over half an hour after which i gave up). What I changed: What I tried: What I'm running on: |
I've set
use_gpu = True
, but the GPU useage is almost close to zero when running the code. When I look into tensorboard, it shows that all operations are assigned to CPU. Then I disablesess_config = tf.ConfigProto(allow_soft_placement=True)
and force it running on GPU, the system console throws an error as:`INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\Code\PythonScripts\DeepRL\BatchPPO\20180308T091941-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.
2018-03-08 09:19:41.315004: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-08 09:19:41.595863: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.64GiB
2018-03-08 09:19:41.596493: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Graph contains 42003 trainable variables.
2018-03-08 09:19:57.811479: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
Traceback (most recent call last):
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 127, in train
utility.initialize_variables(sess, saver, config.logdir)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 116, in initialize_variables
tf.global_variables_initializer()))
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 889, in run
run_metadata_ptr)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]
Caused by op 'ppo_temporary/episodes/Variable', defined at:
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 113, in train
batch_env, config.algorithm, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 48, in define_simulation_graph
algo = algo_cls(batch_env, step, is_training, should_log, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\algorithm.py", line 78, in init
template, len(batch_env), config.max_length, 'episodes')
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\memory.py", line 44, in init
self._length = tf.Variable(tf.zeros(capacity, tf.int32), False)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 213, in init
constraint=constraint)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 331, in _init_from_args
name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\state_ops.py", line 133, in variable_op_v2
shared_name=shared_name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 926, in _variable_v2
shared_name=shared_name, name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]`
It seems that tensorflow does not allow assign an int type variable on GPU.
The text was updated successfully, but these errors were encountered: