分数计算逻辑似乎有问题导致n_sampling没生效？

下面这块代码，我理解是，对于每个问题只取n个sample的第0个的分数的均值作为acc。那么n_sampling>1就没意义了。
[evaluate.py#line78](https://github.com/QwenLM/Qwen2.5-Math/blob/a45202bd16f1ec06f433442dc1152d0074773465/evaluation/evaluate.py#L78)
```python
score_mat = []
for sample in samples:
    sample['score'] = scores[idx: idx+len(sample['pred'])]
    assert len(sample['score']) == len(sample['pred'])
    score_mat.append(sample['score'])
    idx += len(sample['pred'])

max_len = max([len(s) for s in score_mat])

for i, s in enumerate(score_mat):
    if len(s) < max_len:
        score_mat[i] = s + [s[-1]] * (max_len - len(s)) # pad

# output mean of each column of scores
col_means= np.array(score_mat).mean(axis=0)
mean_score = list(np.round(col_means * 100, decimals=1))

result_json = {
    "num_samples": len(samples),
    "num_scores": len(scores),
    "timeout_samples": timeout_cnt,
    "empty_samples": len([s for s in samples if not s['pred'][-1]]),
    "acc": mean_score[0]
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

分数计算逻辑似乎有问题导致n_sampling没生效？ #58

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

分数计算逻辑似乎有问题导致n_sampling没生效？ #58

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions