Thanks for the excellent work!
In the paper, the text-to-image generation model is reported to be trained on 45M images. However, SAM-1B contains ~10M images, JourneyDB ~4M, and ImageNet-1K ~1M.
Could you please kindly clarify where the 45M images comes from, or am I misunderstanding something?