You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am debugging poor performance of a model I'm experimenting with. It gets pretty good CoreEN scores, but it is generating nonsensical responses when running commonsense_evaluate.py. For instance, it gives repeated tokens for a lot of inputs.
After some more digging, it looks like this generation call is causing a problem when the batch size is greater than 1.
In this case, padding tokens will be added to many of the batch elements. The generate() call isn't given an indication of how many padding tokens are being used. This causes my model to generate garbage outputs in cases where lots of padding appears in a batch. If I change the batch size to 1, outputs are much more reasonable.
It seems like this could be the cause of #38 . In that case, users are evaluating with batch sizes greater than 1, which seems likely to cause an issue.
Also FWIW, I am not sure why commonsense_evaluate.py allows users to choose a batch size, but evaluate.py does not. I'm guessing that's why I'm seeing issues about evaluate.py but not commonsense_evaluate.py.
The text was updated successfully, but these errors were encountered:
Hi,
Many thanks for pointing out this issue! I added batch decoding to commonsense_evaluate.py for acceleration as the target response of the commonsense task is very short. But the inputs in the commonsense task can be very long, so I used batch_size=1 for my experiments. That's why I didn't encounter this issue.
I'm trying to figure out the solution of this issue. If you have a method in mind to fix it, it's nice to submit a PR.
I am debugging poor performance of a model I'm experimenting with. It gets pretty good CoreEN scores, but it is generating nonsensical responses when running
commonsense_evaluate.py
. For instance, it gives repeated tokens for a lot of inputs.After some more digging, it looks like this generation call is causing a problem when the batch size is greater than 1.
In this case, padding tokens will be added to many of the batch elements. The
generate()
call isn't given an indication of how many padding tokens are being used. This causes my model to generate garbage outputs in cases where lots of padding appears in a batch. If I change the batch size to 1, outputs are much more reasonable.It seems like this could be the cause of #38 . In that case, users are evaluating with batch sizes greater than 1, which seems likely to cause an issue.
Also FWIW, I am not sure why
commonsense_evaluate.py
allows users to choose a batch size, butevaluate.py
does not. I'm guessing that's why I'm seeing issues aboutevaluate.py
but notcommonsense_evaluate.py
.The text was updated successfully, but these errors were encountered: