Unable to Reproduce MOTO Results on CALVIN Benchmark

Hello @yxgeee, @ChenYi99, @yeliudev  

Thank you for your great work! The codebase is well-structured and easy to follow.

I’ve been trying to reproduce the MOTO results on the CALVIN benchmark by training all three stages as described in the paper. However, the results I’m obtaining do not match those reported in the paper. I’ve followed the training setup exactly—using the same number of epochs, batch size, and gradient accumulation steps. To verify, I also compared the training losses with the logs you’ve provided here in this issue #10. 
I’ve attached my logs and included comparison plots below for reference.

> [pretrain_moto_gpt_on_calvin.log](https://github.com/user-attachments/files/19769510/pretrain_moto_gpt_on_calvin.log)
> [finetune_moto_gpt_on_calvin.log](https://github.com/user-attachments/files/19769509/finetune_moto_gpt_on_calvin.log)
> [train_latent_motion_tokenizer_on_calvin.log](https://github.com/user-attachments/files/19769511/train_latent_motion_tokenizer_on_calvin.log) 

 _Originally posted by @ChenYi99 in [#10](https://github.com/TencentARC/Moto/issues/10#issuecomment-2808062971)_



**Training hyper parameters that i follow to reproduce results :** 

 Stage 1(Latent motion Tokenizer) : 150K steps(Epochs 11)  ; Effective Batch size - 256

Stage 2(Pre-training Moto-GPT) :  10 Epochs  ; Effective Batch Size - 512

Stage 3(Finetuning Moto-GPT) :   18  Epochs ; Effective Batch Size - 512


**Results  Reproduced :** 

Average successful sequence length: 1.821
Success rates  for 5 tasks completed  in a row on 1000 chains:

T1: 73.4%
T2: 47.6%
T3: 30.1%
T4: 19.7%
T5: 11.3%

Here are the training logs for my 3 stages of training and  plots of reproduced logs comparing with  provided log  for three stages : 

*Reproduced Training logs**

[reproduce-finetune-moto-gpt-calvin-from-lmt-150k.log](https://github.com/user-attachments/files/21470906/reproduce-finetune-moto-gpt-calvin-from-lmt-150k.log)
[reproduce-latent-motion-tokenizer-calvin.log](https://github.com/user-attachments/files/21470907/reproduce-latent-motion-tokenizer-calvin.log)
[reproduce-pretrain-moto-gpt-calvin.log](https://github.com/user-attachments/files/21470905/reproduce-pretrain-moto-gpt-calvin.log)


**Comparison of Training losses** 

<img width="955" height="459" alt="Image" src="https://github.com/user-attachments/assets/0fab2212-e308-4d84-8a22-8dc5565a3460" />

<img width="923" height="449" alt="Image" src="https://github.com/user-attachments/assets/2f48d10d-af4a-4e74-a68e-e7b655fb166a" />

<img width="938" height="275" alt="Image" src="https://github.com/user-attachments/assets/5c8a2f48-fa9e-48fe-bc5e-9bccb0a6726d" />


**Debugging with provided checkpoints**  

Interestingly, I observed that when I use the provided Stage 2 checkpoint and proceed with Stage 3 finetuning, the results are close to what is reported in the paper. However, when I re-train Stage 2 myself, the downstream performance drops drastically.

Here’s a snapshot of the experiments I conducted to narrow down the issue:

<p align="center"> <img width="918" height="435" alt="Experiment Results" src="https://github.com/user-attachments/assets/31ad7b5a-3969-4a49-a75c-a9bd01ffda32" /> </p>

Could you kindly clarify:

- Are there any undocumented details or changes related to Stage 2 training?
- Were there any differences in training environments, seeds, or data processing that might impact reproducibility?
- Are there known issues with the current Stage 2 training code?

Also, has there been any recent update to the codebase that might affect training behavior or results?

I’d really appreciate your help in clarifying this. I’m happy to share additional logs or experiment details if needed.

Thanks again for your excellent work and for open-sourcing this project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to Reproduce MOTO Results on CALVIN Benchmark #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to Reproduce MOTO Results on CALVIN Benchmark #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions