Skip to content

A question about the result of Crypto dataset in Table 2 #1

@seamoke

Description

@seamoke

Could you please take a look at the test results for the Crypto dataset in Table 2? On Qwen-7B, the Pass@1/Pass@k scores for the Pass@1 training method are only 1.2/5.7 , while the result for the 'P@k T. + P@1 T.' method is nearly 97%. On Qwen-32B, even the standard Pass@1 training achieves a score of about 96%. This suggests that the significant performance gap may not be solely due to the method itself, but rather seems to be caused by the model reaching a sudden breakthrough or 'eureka' moment during training. I believe your method is effective, but I'm concerned that some of these dramatic improvements might be attributed to other factors, such as the Qwen model beginning to engage in longer reasoning processes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions