Possible Bug In Handling Batch Size During Common Sense Evaluation #61

mchorton · 2024-04-02T00:32:39Z

I am debugging poor performance of a model I'm experimenting with. It gets pretty good CoreEN scores, but it is generating nonsensical responses when running commonsense_evaluate.py. For instance, it gives repeated tokens for a lot of inputs.

After some more digging, it looks like this generation call is causing a problem when the batch size is greater than 1.

In this case, padding tokens will be added to many of the batch elements. The generate() call isn't given an indication of how many padding tokens are being used. This causes my model to generate garbage outputs in cases where lots of padding appears in a batch. If I change the batch size to 1, outputs are much more reasonable.

It seems like this could be the cause of #38 . In that case, users are evaluating with batch sizes greater than 1, which seems likely to cause an issue.

Also FWIW, I am not sure why commonsense_evaluate.py allows users to choose a batch size, but evaluate.py does not. I'm guessing that's why I'm seeing issues about evaluate.py but not commonsense_evaluate.py.

The text was updated successfully, but these errors were encountered:

HZQ950419 · 2024-04-09T12:48:13Z

Hi,
Many thanks for pointing out this issue! I added batch decoding to commonsense_evaluate.py for acceleration as the target response of the commonsense task is very short. But the inputs in the commonsense task can be very long, so I used batch_size=1 for my experiments. That's why I didn't encounter this issue.

I'm trying to figure out the solution of this issue. If you have a method in mind to fix it, it's nice to submit a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Bug In Handling Batch Size During Common Sense Evaluation #61

Possible Bug In Handling Batch Size During Common Sense Evaluation #61

mchorton commented Apr 2, 2024

HZQ950419 commented Apr 9, 2024

Possible Bug In Handling Batch Size During Common Sense Evaluation #61

Possible Bug In Handling Batch Size During Common Sense Evaluation #61

Comments

mchorton commented Apr 2, 2024

HZQ950419 commented Apr 9, 2024