-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy network penalized incorrectly on invalid moves #77
Comments
Err, a correction: it should be masked with valid moves, and then re-normalized. Otherwise it's not changing anything. |
For reference: I fixed this bug in my local copy, and it appears to be learning several times faster . Passing in the valid moves requires adapting a lot of the interfaces though, so, changing all the game implementations to do the masking. And my local copy is a huge mess, so I don't have anything I can easily offer as a PR. But as some reference as to what I did, in Coach.py, the trainingExamples need to be extended to include what the valid moves are, so that's got something like `
whereas obviously the ideal thing would be adapting getSymmetries to also permute Then
Then Finally,
and include |
"For reference: I fixed this bug in my local copy, and it appears to be learning several times faster ." Wow! |
I realized that my wording above was somewhat vague.
with
The alternative is:
which just reduces the log-probabilities to such a negative value as to be effectively zero. |
@Timeroot |
I'm almost there, but I have a problem with implementing your idea in tensorflow. Maybe you can help me. I have a problem with this piece of your description: replace In tensorflow.OthelloNet this should be around line 36: I am new in this matter. Can you tell me how to do this for Tensorflow? |
@keesdejong |
Any news on when this is going to be updated for the main branch? |
I followed your fix and got small negative pi loss (around -0.00xx). Is this normal ? |
@51616 If you made a fix, please upload PR |
@jl1990 i use this equation instead. |
I made a pull with the changes introduced for Othello's Tensorflow implementation. Also can someone provide some guidance as to how I would properly cite Timeroot for the changes, since I essentially just implemented what he suggested verbatim? Here's the pull: |
@rlronan Thank you. I'll see if I find the time to check if it works. And, most importantly, whether it has any effect. |
This thread is a bit older, but since its pinned here is an idea: |
Adding my opinion on this (old) thread: The second one results in slightly better results in my game (which has about 50% invalid moves on average), and slightly faster training (~5%) but I didn't tried with Othello though. |
Currently, the policy network is actively trained to output 0 on a move whenever it's invalid. The move is never taken, so target_pi's is zero there, and this enters into the loss function as a result. As a result, the policy network will be saying as much about "How often is this move legal" as it does about "Is this a good move" -- which is almost certainly hurting performance.
The correct action would be to manually mask out the result in NNet.py, right after getting the output from OthelloNNet.py: OthelloNNet.py returns F.log_softmax(pis), and then the entries there should be manually set to zero if it's not a valid move. This will prevent gradients from propagating back there.
The text was updated successfully, but these errors were encountered: