-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error "Tensor is not a torch image" was thrown out when run the code. #3
Comments
Thanks for using our code. Could you please show more error messages, especially for which line in the file optimize.py generates such an error? It seems to be an error of failing to using transforms.ToTensor() before transforms.Resize() or transforms.Normalize(). |
I notice that in your first step, there is no error. But in the second step, the error appears. I am not sure how you dismiss the distributed training. I suppose the error happens that you fail to ensure the image (the optimized object) is still a 4D tensor after each step. In our FSDP setup, the parameters (image) will be flattened to 1D. Therefore, when updating the parameters in the main process, we first transform the parameters back to 4D and conduct the parameter update, and then flatten back to 1D to ensure the FSDP could work. If you dismiss the FSDP, please ensure the image should be 4D (in line 194-211). |
Oh, I run the code on a 48G GPU. |
As a default, I adopt a FSDP strategy on 4 A100 GPUs to run the code. I remember on one GPU, the OOM error will appear especially for the high diversity chat mode. Therefore, if you would like to reproduce our results, I recommend you to use our FSDP version. Else if you would like to reuse our code for your own target, e.g. using a small VLM and short prompts, please let me know I will check my local repo to see whether I could provide a single-GPU version. |
I think FSDP strategy is very memory / time efficient, even if when you want to ensemble more large VLMs, adopt larger batch size, larger image size, etc. |
Thank you! I'll try it later. |
I’m facing the same issue and I have adopted the FSDP strategy on 4 A100 GPUs to run the code. Do you have any idea how this can be fixed? |
When does this error ("Tensor is not a torch image" pop up;) happen? Which line? I am wondering whether this is caused by |
Running optimization tasks... Mixed precision type: bf16 Distributed environment: MULTI_GPU Backend: nccl Mixed precision type: bf16 Distributed environment: MULTI_GPU Backend: nccl Mixed precision type: bf16 Distributed environment: MULTI_GPU Backend: nccl Mixed precision type: bf16 Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:02<00:01, 1.24s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00, 1.07s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00, 1.10s/it] Steps: 0%| | 1/25600 [00:00<?, ?it/s]You may have used the wrong order for inputs. Steps: 0%| | 2/25600 [00:01<7:18:21, 1.03s/it] Steps: 0%| | 2/25600 [00:01<8:12:16, 1.15s/it, Overall_loss=nan, RAG_loss=-0.103, VLM_loss=nan]
|
============================================================================================== Remember that many packages have already been installed on the system and can
|
Why this error came out when I run the optimize.py file? I just dismiss the distributed training.
The text was updated successfully, but these errors were encountered: