A question about the result of Crypto dataset in Table 2

Could you please take a look at the test results for the Crypto dataset in Table 2? On Qwen-7B, the Pass@1/Pass@k scores for the Pass@1 training method are only 1.2/5.7 , while the result for the 'P@k T. + P@1 T.' method is nearly 97%. On Qwen-32B, even the standard Pass@1 training achieves a score of about 96%. This suggests that the significant performance gap may not be solely due to the method itself, but rather seems to be caused by the model reaching a sudden breakthrough or 'eureka' moment during training. I believe your method is effective, but I'm concerned that some of these dramatic improvements might be attributed to other factors, such as the Qwen model beginning to engage in longer reasoning processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A question about the result of Crypto dataset in Table 2 #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A question about the result of Crypto dataset in Table 2 #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions