Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I set it up so I can run it with 24GB ? #15

Open
jim-1ee opened this issue Nov 28, 2024 · 3 comments
Open

How do I set it up so I can run it with 24GB ? #15

jim-1ee opened this issue Nov 28, 2024 · 3 comments

Comments

@jim-1ee
Copy link

jim-1ee commented Nov 28, 2024

i use this set

--gradient_checkpointing \

--mixed_precision fp16
--use_8bit_adam
--set_grads_to_none
but error is
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 23.69 GiB of which 313.69 MiB is free. Including non-PyTorch memory, this process has 23.38 GiB memory in use. Of the allocated memory 22.93 GiB is allocated by PyTorch, and 128.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

can you help me ?thank you~

@xdobetter
Copy link
Contributor

You can refer to #10 (comment)

@marcusrdlee
Copy link

I am using the flags suggested in #10

  # --gradient_checkpointing \
   --mixed_precision fp16 \
   --use_8bit_adam \
   --set_grads_to_none \

But I am still getting the CUDA memory issue for 24GB

return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 23.58 GiB of which 199.38 MiB is free. Including non-PyTorch memory, this process has 22.31 GiB memory in use. Of the allocated memory 21.89 GiB is allocated by PyTorch, and 101.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I traced and noticed that this issue results from the switch from phase1 to phase2 training once the number of steps reaches phase1_train_steps value. It reaches the CUDA memory limit issue at line
self.accelerator.backward(loss)

What else do you suggest to resolve this? Can you elaborate on your fix or was it just setting the above flags? I have attempted many different configurations for my accelerate environment but cannot bypass this memory issue. Please let me know what else I can try. Thanks for the help!

@atonalfreerider
Copy link

in configs/default.yaml you need to set fp16: True, and then your geometry training will run

However, that causes an error with texture training here:

  File "/home/john/Desktop/3DPose/PuzzleAvatar/thirdparties/nvdiffrast/nvdiffrast/torch/ops.py", line 657, in forward
    out, work_buffer = _get_plugin().antialias_fwd(color, rast, pos, tri, topology_hash)
RuntimeError: antialias_fwd(): Inputs color, rast, pos must be float32 tensors

I'm able to run float32 on an 11GB VRAM GPU for texture training by adding these lines to guidance.py

            pipe = DiffusionPipeline.from_pretrained(
                self.base_model_key,
                torch_dtype=torch.float32,
                requires_safety_checker=False,
            ).to(self.device)
            # add memory offloading
            pipe.enable_model_cpu_offload()
            if not cfg.train.fp16:
                # fp32 requires extra low memory settings
                pipe.enable_vae_slicing()
                pipe.enable_vae_tiling()

I also needed to remove the unused prediction depth values from these lines from cores/lib/trainer.py

preds_depth = preds_depth * preds_alpha + (1 - preds_alpha)

                    preds_depth_list = [
                        torch.zeros_like(preds_depth).to(self.device)
                        for _ in range(self.world_size)
                    ]    # [[B, ...], [B, ...], ...]
                    dist.all_gather(preds_depth_list, preds_depth)
                    preds_depth = torch.cat(preds_depth_list, dim=0)

I hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants