It seems the released pretrained model has different model structure to code on this lab;
For example, there is actually no "model.vision_model" on retrieval BLIP, so there is no way to use the blip-itm-base-flickr , which only contains this attribute but no "visual_encoder"
Please update related code or give us detailed instruction on using it.
Thank you very much for your time and assistance.