Is fusion_encoder getting used for video captioning?

Hello! 

I was looking at [model_video_caption_mplug.py](https://github.com/X-PLUG/mPLUG-2/blob/a46aa972d21f706da58bf5c0b50c123f6fd1d8b0/models/model_video_caption_mplug.py#L25C9-L25C28) and saw that `self.fusion_encoder` is not used in the forward pass.

Do we need to instantiate it for the video captioning tasks?