-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hello @yxgeee, @ChenYi99, @yeliudev
Thank you for your great work! The codebase is well-structured and easy to follow.
I’ve been trying to reproduce the MOTO results on the CALVIN benchmark by training all three stages as described in the paper. However, the results I’m obtaining do not match those reported in the paper. I’ve followed the training setup exactly—using the same number of epochs, batch size, and gradient accumulation steps. To verify, I also compared the training losses with the logs you’ve provided here in this issue #10.
I’ve attached my logs and included comparison plots below for reference.
pretrain_moto_gpt_on_calvin.log
finetune_moto_gpt_on_calvin.log
train_latent_motion_tokenizer_on_calvin.log
Originally posted by @ChenYi99 in #10
Training hyper parameters that i follow to reproduce results :
Stage 1(Latent motion Tokenizer) : 150K steps(Epochs 11) ; Effective Batch size - 256
Stage 2(Pre-training Moto-GPT) : 10 Epochs ; Effective Batch Size - 512
Stage 3(Finetuning Moto-GPT) : 18 Epochs ; Effective Batch Size - 512
Results Reproduced :
Average successful sequence length: 1.821
Success rates for 5 tasks completed in a row on 1000 chains:
T1: 73.4%
T2: 47.6%
T3: 30.1%
T4: 19.7%
T5: 11.3%
Here are the training logs for my 3 stages of training and plots of reproduced logs comparing with provided log for three stages :
Reproduced Training logs*
reproduce-finetune-moto-gpt-calvin-from-lmt-150k.log
reproduce-latent-motion-tokenizer-calvin.log
reproduce-pretrain-moto-gpt-calvin.log
Comparison of Training losses
Debugging with provided checkpoints
Interestingly, I observed that when I use the provided Stage 2 checkpoint and proceed with Stage 3 finetuning, the results are close to what is reported in the paper. However, when I re-train Stage 2 myself, the downstream performance drops drastically.
Here’s a snapshot of the experiments I conducted to narrow down the issue:
Could you kindly clarify:
- Are there any undocumented details or changes related to Stage 2 training?
- Were there any differences in training environments, seeds, or data processing that might impact reproducibility?
- Are there known issues with the current Stage 2 training code?
Also, has there been any recent update to the codebase that might affect training behavior or results?
I’d really appreciate your help in clarifying this. I’m happy to share additional logs or experiment details if needed.
Thanks again for your excellent work and for open-sourcing this project
