Skip to content

[FT] Log progress bar on main process #662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lewtun opened this issue Apr 7, 2025 · 0 comments
Open

[FT] Log progress bar on main process #662

lewtun opened this issue Apr 7, 2025 · 0 comments
Labels
feature request New feature/request

Comments

@lewtun
Copy link
Member

lewtun commented Apr 7, 2025

Issue encountered

During evaluation one sees repeated progress bars that don't convey the global progress towards an eval being finished, e.g.

Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 7x across cluster]
Processed prompts:  25%|██▌       | 1/4 [00:14<00:43, 14.53s/it, est. speed input: 15.14 toks/s, output: 221.23 toks/s] [repeated 3x across cluster]
Processed prompts:  50%|█████     | 2/4 [00:22<00:23, 11.99s/it, est. speed input: 14.81 toks/s, output: 299.95 toks/s] [repeated 2x across cluster]
Processed prompts:  50%|█████     | 2/4 [00:30<00:27, 13.71s/it, est. speed input: 9.25 toks/s, output: 389.36 toks/s] [repeated 4x across cluster]
Processed prompts: 100%|██████████| 4/4 [00:35<00:00,  8.89s/it, est. speed input: 27.27 toks/s, output: 577.98 toks/s]
(run_inference_one_model pid=3713070) [rank0]:[W407 09:31:09.544209083 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Processed prompts:  33%|███▎      | 1/3 [00:43<01:27, 43.61s/it, est. speed input: 3.26 toks/s, output: 192.65 toks/s]
Processed prompts:  75%|███████▌  | 3/4 [01:17<00:30, 30.61s/it, est. speed input: 6.71 toks/s, output: 244.39 toks/s] 
Processed prompts:  67%|██████▋   | 2/3 [01:24<00:42, 42.04s/it, est. speed input: 4.23 toks/s, output: 229.77 toks/s]
Processed prompts:  50%|█████     | 2/4 [01:48<02:06, 63.09s/it, est. speed input: 2.57 toks/s, output: 126.10 toks/s] 
Processed prompts: 100%|██████████| 3/3 [01:52<00:00, 37.62s/it, est. speed input: 4.68 toks/s, output: 286.81 toks/s]
(run_inference_one_model pid=3713065) [rank0]:[W407 09:32:27.439066098 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Processed prompts:  75%|███████▌  | 3/4 [01:58<00:47, 47.32s/it, est. speed input: 4.58 toks/s, output: 212.09 toks/s]
Processed prompts:  50%|█████     | 2/4 [02:22<02:40, 80.26s/it, est. speed input: 3.26 toks/s, output: 130.89 toks/s]
Processed prompts: 100%|██████████| 4/4 [02:42<00:00, 40.55s/it, est. speed input: 4.09 toks/s, output: 215.71 toks/s]
(run_inference_one_model pid=3713080) [rank0]:[W407 09:33:18.025165983 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Processed prompts:  75%|███████▌  | 3/4 [02:46<01:00, 60.80s/it, est. speed input: 2.72 toks/s, output: 180.25 toks/s]
Processed prompts: 100%|██████████| 4/4 [02:50<00:00, 42.71s/it, est. speed input: 4.44 toks/s, output: 298.07 toks/s]
(run_inference_one_model pid=3713072) [rank0]:[W407 09:33:24.914485162 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Processed prompts:  75%|███████▌  | 3/4 [02:48<00:55, 55.45s/it, est. speed input: 3.69 toks/s, output: 206.29 toks/s]
Processed prompts:  25%|██▌       | 1/4 [02:59<08:57, 179.08s/it, est. speed input: 0.99 toks/s, output: 93.36 toks/s]
Processed prompts:  67%|██████▋   | 2/3 [03:45<02:10, 130.81s/it, est. speed input: 1.66 toks/s, output: 97.46 toks/s] 
Processed prompts:  50%|█████     | 2/4 [04:18<04:00, 120.37s/it, est. speed input: 1.19 toks/s, output: 148.48 toks/s]
Processed prompts: 100%|██████████| 4/4 [04:34<00:00, 68.68s/it, est. speed input: 2.64 toks/s, output: 176.78 toks/s]

This makes it hard to know how long an eval will take in total.

Solution/Feature

I think it should be possible to solve this by having the progress bar log iterations on the main process.

Possible alternatives

If not possible to log on the main process, providing some additional information in the progress bar description that allows users to reconstruct the global progress would be good.

@lewtun lewtun added the feature request New feature/request label Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

1 participant