You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, I am trying to merge models for SAC discrete and continous version into just 1 model.
According to SAC discrete critic_model, it only need input state and output action distribution. To make it as consistent with continous one, I modified it with input with both state and action, and only output q-value for the input q(s,a)---just like what happens in continuous version. Also, for the training part, now it is ok to use the same code, without considering action distributions when updating parameter. BUT the modified SAC for discrete actions just doesn't converge!
The code below is some of what I modified to let the discrete version to have similar behavior as that in continuous version,
but as it doesn't converge, I guess whethere it is something wrong with log_prob?
_dist = self.distribution(action_dist) #torch.distributions.Categorical
actions = _dist.sample()
# modified version
actions = actions.unsqueeze(1)
self.log_prob = torch.log(actions + (actions == 0.0).float() * 1e-8)
# original version
# z = (action_dist == 0.0).float() * 1e-8
# self.log_prob = torch.log(action_dist + z)
# actions = actions.unsqueeze(1)# add batch dim
I wonder whether it is possible to let SAC-discrete version to update the same way as in sac-continuous?
If it is possible, then it is happy to use almost the same code for both discrete and continuous version--- that is what I want.
The text was updated successfully, but these errors were encountered:
Currently, I am trying to merge models for SAC discrete and continous version into just 1 model.
According to SAC discrete critic_model, it only need input state and output action distribution. To make it as consistent with continous one, I modified it with input with both state and action, and only output q-value for the input q(s,a)---just like what happens in continuous version. Also, for the training part, now it is ok to use the same code, without considering action distributions when updating parameter. BUT the modified SAC for discrete actions just doesn't converge!
The code below is some of what I modified to let the discrete version to have similar behavior as that in continuous version,
but as it doesn't converge, I guess whethere it is something wrong with log_prob?
I wonder whether it is possible to let SAC-discrete version to update the same way as in sac-continuous?
If it is possible, then it is happy to use almost the same code for both discrete and continuous version--- that is what I want.
The text was updated successfully, but these errors were encountered: