You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After enabling the Flop Profiler with the exact same configuration as given in the example here: https://www.deepspeed.ai/tutorials/flops-profiler/ I get a different output from the one presented in the documentation. What I get looks like this
I am training on 4 gpus with the following DeepSpeed config file:
I initially thought I would get one such output from each of the GPUs so a total of 4, but instead I get 8 outputs with different FLOPS values: ['2.31 TFLOPS', '2.01 TFLOPS', '2.07 TFLOPS', '2.06 TFLOPS', '2.06 TFLOPS', '2.09 TFLOPS', '2.08 TFLOPS', '1.86 TFLOPS']
When I then trained a larger model with 4 GPUs and same DeepSpeed config as above except for train_micro_batch_size_per_gpu=2 I instead got 16 different outputs.
What am I actually getting a profiling of here? Is it not the FLOPS per GPU? What I want to do is to calculate the total FLOPS for a model, how would I do that?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
After enabling the Flop Profiler with the exact same configuration as given in the example here: https://www.deepspeed.ai/tutorials/flops-profiler/ I get a different output from the one presented in the documentation. What I get looks like this
I am training on 4 gpus with the following DeepSpeed config file:
{ "train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 1000, "optimizer": { "type": "Adam", "params": { "lr": 1e-5 } }, "gradient_clipping": 1.0, "zero_optimization": { "stage": 3, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e8, "stage3_param_persitance_threshold": 1e5, "stage3_prefetch_bucket_size": 5e7, "contiguous_gradients": true, "cpu_offload": true, "cpu_offload_params": true, "cpu_offload_use_pin_memory": true, "overlap_comm": true, "reduce_bucket_size": 90000000, "sub_group_size": 4e8 }, "wall_clock_breakdown": false, "fp16": { "enabled": true, "loss_scale": 1024, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } }
I initially thought I would get one such output from each of the GPUs so a total of 4, but instead I get 8 outputs with different FLOPS values: ['2.31 TFLOPS', '2.01 TFLOPS', '2.07 TFLOPS', '2.06 TFLOPS', '2.06 TFLOPS', '2.09 TFLOPS', '2.08 TFLOPS', '1.86 TFLOPS']
When I then trained a larger model with 4 GPUs and same DeepSpeed config as above except for train_micro_batch_size_per_gpu=2 I instead got 16 different outputs.
What am I actually getting a profiling of here? Is it not the FLOPS per GPU? What I want to do is to calculate the total FLOPS for a model, how would I do that?
Many thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions