-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi authors and community,
Thank you for sharing this excellent work. I’m currently reproducing the two-stage training process described in the paper “DiffusionRet: Generative Text-Video Retrieval with Diffusion Model” using the official GitHub code and the MSRVTT dataset.
- Reproduction setup:
Stage 1 (Discriminative): trained from scratch, achieved Recall@1 = 46.8, which seems consistent.
Stage 2 (Generative): initialized from the best.pth of Stage 1, followed all steps in the repo and paper.
- Problem:
During the Stage 2 fine-tuning:
Recall@1 barely improves, only rising to 47.3, which is lower than the reported 49.0 in the paper.
In training logs, Recall@1 stays near 46~47 throughout and plateaus very early.
The generation loss does decrease as expected, suggesting the model is training, but retrieval does not improve accordingly.
- My question:
Are there any training details, hyperparameter schedules, or tricks used in the paper but not explicitly reflected in the released code?
Has anyone else experienced this performance gap when trying to reproduce the second stage?
Any insights or shared experiences would be greatly appreciated.Thanks again for the great work and for maintaining this repo!