questions about  inconsistent  evaluation result

Hi，i have used deepspeed framework to train gpt-117M model.
when i evaluate model perfomance on wikitext-103, result by using tasks/eval_harness/evaluate.py  vs.  first convert checkpoint to megatron format and use tasks/main.py , there exists a large performance gap in PPL...
May I ask what is the reason for this phenomenon? @mayank31398