Hello Author,
I fine-tuned the Answerer with the exact same configuration as yours. When I did not refine the pretrained Localizer you provided, the highest accuracy on the NExT-QA val set was 72.4%. However, when I fine-tuned the Answerer using the Localizer refined on NExT-QA, the highest accuracy was only 71.8%. Did I overlook something?