Issue with Low Recall@1 in Stage 2 (Generative Fine-Tuning) on MSRVTT

Hi authors and community,

Thank you for sharing this excellent work. I’m currently reproducing the two-stage training process described in the paper “DiffusionRet: Generative Text-Video Retrieval with Diffusion Model” using the official GitHub code and the MSRVTT dataset.

1. Reproduction setup:

Stage 1 (Discriminative): trained from scratch, achieved Recall@1 = 46.8, which seems consistent.
Stage 2 (Generative): initialized from the best.pth of Stage 1, followed all steps in the repo and paper.


2. Problem:

During the Stage 2 fine-tuning:

Recall@1 barely improves, only rising to 47.3, which is  lower than the reported 49.0 in the paper.
In training logs, Recall@1 stays near 46~47 throughout and plateaus very early.
The generation loss does decrease as expected, suggesting the model is training, but retrieval does not improve accordingly.


3. My question:

Are there any training details, hyperparameter schedules, or tricks used in the paper but not explicitly reflected in the released code?
Has anyone else experienced this performance gap when trying to reproduce the second stage?

Any insights or shared experiences would be greatly appreciated.Thanks again for the great work and for maintaining this repo!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with Low Recall@1 in Stage 2 (Generative Fine-Tuning) on MSRVTT #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue with Low Recall@1 in Stage 2 (Generative Fine-Tuning) on MSRVTT #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions