Bug of PPO

ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action))
            surr = ratio * adv
...
loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) * adv) )

should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug of PPO #1072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug of PPO #1072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions