Skip to content

Unable to Reproduce MOTO Results on CALVIN Benchmark #12

@ManishGovind

Description

@ManishGovind

Hello @yxgeee, @ChenYi99, @yeliudev

Thank you for your great work! The codebase is well-structured and easy to follow.

I’ve been trying to reproduce the MOTO results on the CALVIN benchmark by training all three stages as described in the paper. However, the results I’m obtaining do not match those reported in the paper. I’ve followed the training setup exactly—using the same number of epochs, batch size, and gradient accumulation steps. To verify, I also compared the training losses with the logs you’ve provided here in this issue #10.
I’ve attached my logs and included comparison plots below for reference.

pretrain_moto_gpt_on_calvin.log
finetune_moto_gpt_on_calvin.log
train_latent_motion_tokenizer_on_calvin.log

Originally posted by @ChenYi99 in #10

Training hyper parameters that i follow to reproduce results :

Stage 1(Latent motion Tokenizer) : 150K steps(Epochs 11) ; Effective Batch size - 256

Stage 2(Pre-training Moto-GPT) : 10 Epochs ; Effective Batch Size - 512

Stage 3(Finetuning Moto-GPT) : 18 Epochs ; Effective Batch Size - 512

Results Reproduced :

Average successful sequence length: 1.821
Success rates for 5 tasks completed in a row on 1000 chains:

T1: 73.4%
T2: 47.6%
T3: 30.1%
T4: 19.7%
T5: 11.3%

Here are the training logs for my 3 stages of training and plots of reproduced logs comparing with provided log for three stages :

Reproduced Training logs*

reproduce-finetune-moto-gpt-calvin-from-lmt-150k.log
reproduce-latent-motion-tokenizer-calvin.log
reproduce-pretrain-moto-gpt-calvin.log

Comparison of Training losses

Image Image Image

Debugging with provided checkpoints

Interestingly, I observed that when I use the provided Stage 2 checkpoint and proceed with Stage 3 finetuning, the results are close to what is reported in the paper. However, when I re-train Stage 2 myself, the downstream performance drops drastically.

Here’s a snapshot of the experiments I conducted to narrow down the issue:

Experiment Results

Could you kindly clarify:

  • Are there any undocumented details or changes related to Stage 2 training?
  • Were there any differences in training environments, seeds, or data processing that might impact reproducibility?
  • Are there known issues with the current Stage 2 training code?

Also, has there been any recent update to the codebase that might affect training behavior or results?

I’d really appreciate your help in clarifying this. I’m happy to share additional logs or experiment details if needed.

Thanks again for your excellent work and for open-sourcing this project

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions