You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, what are the requirement for NVLINK to function. I have 2 machine - one is regular pcie 3090 - 2 x cards in nvlink - works good and nvlink shows activity via :
nvidia-smi nvlink -gt r
and DGX-1 server - NVLINK is not activated by DeepSpeed. - show activity as N/A, although
nvidia-smi topo - m / nvidia-smi nvlink -s
show all nvlink present and ready to go. DGX-1 server run by bare (not NVIDIA image) Ubuntu 20, pytorch, cuda, driver, nvcc, nccl - all installed, DeepSpeed compiled with 7.0 gpu feature ...
Would appreciate any help - terminal screen available if needed. \
I am testing on GPT-J-6B - fine-tuning - using code from this repo:
.......
nvidia-smi nvlink -gt r :
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ed46f244-d7b4-5053-89bc-f68119fa49e9)
Link 0: Raw Tx: N/A
Link 0: Raw Rx: N/A
......
Link 5: Raw Tx: N/A
Link 5: Raw Rx: N/A
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 183.22it/s]
[INFO|tokenization_utils_base.py:1629] 2021-10-10 00:04:23,020 >> Offline mode: forcing local_files_only=True
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 187.21it/s]
[INFO|tokenization_utils_base.py:1717] 2021-10-10 00:04:23,023 >> Can't load following files from cache: ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file'] and cannot check if these files are necessary for the tokenizer to operate.
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/user/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/user/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/user/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|configuration_utils.py:531] 2021-10-10 00:04:23,023 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:584] 2021-10-10 00:04:23,024 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:621] 2021-10-10 00:04:23,025 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.12.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 351.00it/s]
[INFO|modeling_utils.py:1225] 2021-10-10 00:04:23,138 >> Offline mode: forcing local_files_only=True
[INFO|modeling_utils.py:1324] 2021-10-10 00:04:23,139 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/user/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1589] 2021-10-10 00:04:45,762 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1597] 2021-10-10 00:04:45,763 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.93ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.81ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.84ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.88ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.31ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.25ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.27ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.22ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.05ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.19ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.14ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.27ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.07ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.19ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.20ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 6.23ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.14ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 5.78ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.62ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.58ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.72ba/s]
[INFO|trainer.py:434] 2021-10-10 00:05:05,020 >> Using amp fp16 backend
[2021-10-10 00:05:05,026] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.5+cd7967d, git-hash=cd7967d, git-branch=master
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.02ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.51ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.08ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:02<00:00, 3.88ba/s]
[2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups
[2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1
[2021-10-10 00:05:15,992] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1
[2021-10-10 00:05:16,003] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0, 1, 2, 3, 4, 5, 6, 7]
[2021-10-10 00:05:16,014] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-10-10 00:05:16,024] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [1]
[2021-10-10 00:05:16,025] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [2]
[2021-10-10 00:05:16,035] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [3]
[2021-10-10 00:05:16,046] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [4]
[2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [5]
[2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [6]
[2021-10-10 00:05:16,068] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [7]
[2021-10-10 00:05:18,134] [INFO] [engine.py:204:init] DeepSpeed Flops Profiler Enabled: False
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/user/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.937584400177002 seconds
[2021-10-10 00:05:20,163] [INFO] [engine.py:862:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
Loading extension module fused_adam...
Time to load fused_adam op: 0.904491662979126 seconds
[2021-10-10 00:05:20,215] [INFO] [engine.py:870:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2021-10-10 00:05:20,215] [INFO] [utils.py:43:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2021-10-10 00:05:20,215] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-10-10 00:05:20,215] [INFO] [stage2.py:111:init] Reduce bucket size 500000000.0
[2021-10-10 00:05:20,215] [INFO] [stage2.py:112:init] Allgather bucket size 500000000.0
[2021-10-10 00:05:20,215] [INFO] [stage2.py:113:init] CPU Offload: False
[2021-10-10 00:05:20,215] [INFO] [stage2.py:114:init] Round robin gradient partitioning: False
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.9040207862854004 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.904207706451416 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 1.0050318241119385 seconds
Time to load fused_adam op: 0.9040296077728271 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 1.0055632591247559 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.9035804271697998 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/user/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.9010043144226074 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.904000997543335 seconds
Time to load utils op: 0.8038058280944824 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.9041271209716797 seconds
Time to load utils op: 0.9041614532470703 seconds
Loading extension module utils...
Time to load utils op: 0.9038200378417969 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.9039947986602783 seconds
Time to load utils op: 0.9037423133850098 seconds
Rank: 4 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 5 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 0 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 3 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 6 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 2 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 1 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 7 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 7 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0008037090301513672 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Time to load utils op: 0.0009813308715820312 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Time to load utils op: 0.0010094642639160156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0010919570922851562 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0010976791381835938 seconds
Time to load utils op: 0.001089334487915039 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, what are the requirement for NVLINK to function. I have 2 machine - one is regular pcie 3090 - 2 x cards in nvlink - works good and nvlink shows activity via :
nvidia-smi nvlink -gt r
and DGX-1 server - NVLINK is not activated by DeepSpeed. - show activity as N/A, although
nvidia-smi topo - m / nvidia-smi nvlink -s
show all nvlink present and ready to go. DGX-1 server run by bare (not NVIDIA image) Ubuntu 20, pytorch, cuda, driver, nvcc, nccl - all installed, DeepSpeed compiled with 7.0 gpu feature ...
Would appreciate any help - terminal screen available if needed. \
I am testing on GPT-J-6B - fine-tuning - using code from this repo:
https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B
.......
nvidia-smi nvlink -gt r :
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ed46f244-d7b4-5053-89bc-f68119fa49e9)
Link 0: Raw Tx: N/A
Link 0: Raw Rx: N/A
......
Link 5: Raw Tx: N/A
Link 5: Raw Rx: N/A
(gpt) user@user-X9DRG-HF:~/transformers/Finetune_GPTNEO_GPTJ6B/finetuning_repo$ TRANSFORMERS_OFFLINE=1 deepspeed --num_gpus=8 run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --overwrite_cache --evaluation_strategy="steps" --output_dir finetuned --num_train_epochs 1 --eval_steps 15 --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --overwrite_output_dir --fp16
[2021-10-10 00:04:17,843] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-10-10 00:04:18,286] [INFO] [runner.py:360:main] cmd = /home/user/gpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 15 --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --overwrite_output_dir --fp16
[2021-10-10 00:04:19,198] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2021-10-10 00:04:19,198] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=8, node_rank=0
[2021-10-10 00:04:19,198] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2021-10-10 00:04:19,198] [INFO] [launch.py:102:main] dist_world_size=8
[2021-10-10 00:04:19,198] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2021-10-10 00:04:20,924] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:20,956] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,018] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,019] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,027] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,051] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,064] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-10-10 00:04:21,162] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
10/10/2021 00:04:22 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=ds_config.json,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=15,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=finetuned/runs/Oct10_00-04-20_user-X9DRG-HF,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
output_dir=finetuned,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=finetuned,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=10,
weight_decay=0.0,
xpu_backend=None,
)
10/10/2021 00:04:22 - WARNING - main - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5
10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 255.47it/s]
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 434.98it/s]
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 356.05it/s]
10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 287.18it/s]
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 299.65it/s]
[INFO|configuration_utils.py:531] 2021-10-10 00:04:23,014 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:584] 2021-10-10 00:04:23,015 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:621] 2021-10-10 00:04:23,017 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.12.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|tokenization_auto.py:310] 2021-10-10 00:04:23,017 >> Offline mode: forcing local_files_only=True
[INFO|tokenization_auto.py:334] 2021-10-10 00:04:23,017 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:531] 2021-10-10 00:04:23,017 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:584] 2021-10-10 00:04:23,018 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:621] 2021-10-10 00:04:23,019 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.12.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 183.22it/s]
[INFO|tokenization_utils_base.py:1629] 2021-10-10 00:04:23,020 >> Offline mode: forcing local_files_only=True
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 187.21it/s]
[INFO|tokenization_utils_base.py:1717] 2021-10-10 00:04:23,023 >> Can't load following files from cache: ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file'] and cannot check if these files are necessary for the tokenizer to operate.
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/user/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/user/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/user/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|configuration_utils.py:531] 2021-10-10 00:04:23,023 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:584] 2021-10-10 00:04:23,024 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:621] 2021-10-10 00:04:23,025 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.12.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 351.00it/s]
[INFO|modeling_utils.py:1225] 2021-10-10 00:04:23,138 >> Offline mode: forcing local_files_only=True
[INFO|modeling_utils.py:1324] 2021-10-10 00:04:23,139 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/user/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1589] 2021-10-10 00:04:45,762 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1597] 2021-10-10 00:04:45,763 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.93ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.81ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.84ba/s]
100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.88ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.31ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.25ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.27ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.22ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.05ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.19ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.14ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.27ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.07ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.19ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.20ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 6.23ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.14ba/s]
100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 5.78ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.62ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.58ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.72ba/s]
[INFO|trainer.py:434] 2021-10-10 00:05:05,020 >> Using amp fp16 backend
[2021-10-10 00:05:05,026] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.5+cd7967d, git-hash=cd7967d, git-branch=master
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.02ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.51ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.08ba/s]
100%|█████████████████████████████████████████████| 8/8 [00:02<00:00, 3.88ba/s]
[2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups
[2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1
[2021-10-10 00:05:15,992] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1
[2021-10-10 00:05:16,003] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0, 1, 2, 3, 4, 5, 6, 7]
[2021-10-10 00:05:16,014] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-10-10 00:05:16,024] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [1]
[2021-10-10 00:05:16,025] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [2]
[2021-10-10 00:05:16,035] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [3]
[2021-10-10 00:05:16,046] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [4]
[2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [5]
[2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [6]
[2021-10-10 00:05:16,068] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [7]
[2021-10-10 00:05:18,134] [INFO] [engine.py:204:init] DeepSpeed Flops Profiler Enabled: False
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/user/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.937584400177002 seconds
[2021-10-10 00:05:20,163] [INFO] [engine.py:862:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
Loading extension module fused_adam...
Time to load fused_adam op: 0.904491662979126 seconds
[2021-10-10 00:05:20,215] [INFO] [engine.py:870:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2021-10-10 00:05:20,215] [INFO] [utils.py:43:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2021-10-10 00:05:20,215] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-10-10 00:05:20,215] [INFO] [stage2.py:111:init] Reduce bucket size 500000000.0
[2021-10-10 00:05:20,215] [INFO] [stage2.py:112:init] Allgather bucket size 500000000.0
[2021-10-10 00:05:20,215] [INFO] [stage2.py:113:init] CPU Offload: False
[2021-10-10 00:05:20,215] [INFO] [stage2.py:114:init] Round robin gradient partitioning: False
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.9040207862854004 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.904207706451416 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 1.0050318241119385 seconds
Time to load fused_adam op: 0.9040296077728271 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 1.0055632591247559 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.9035804271697998 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/user/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.9010043144226074 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.904000997543335 seconds
Time to load utils op: 0.8038058280944824 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.9041271209716797 seconds
Time to load utils op: 0.9041614532470703 seconds
Loading extension module utils...
Time to load utils op: 0.9038200378417969 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.9039947986602783 seconds
Time to load utils op: 0.9037423133850098 seconds
Rank: 4 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 5 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 0 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 3 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 6 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 2 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 1 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Rank: 7 partition count [8] and sizes[(194701400, False)]
[W ProcessGroupNCCL.cpp:1569] Rank 7 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0008037090301513672 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Time to load utils op: 0.0009813308715820312 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
Time to load utils op: 0.0010094642639160156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0010919570922851562 seconds
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0010976791381835938 seconds
Time to load utils op: 0.001089334487915039 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0011730194091796875 seconds
[2021-10-10 00:05:29,913] [INFO] [utils.py:806:see_memory_usage] Before initializing optimizer states
[2021-10-10 00:05:29,913] [INFO] [utils.py:807:see_memory_usage] MA 3.67 GB Max_MA 4.04 GB CA 7.03 GB Max_CA 7 GB
[2021-10-10 00:05:29,914] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4%
[2021-10-10 00:05:29,958] [INFO] [utils.py:806:see_memory_usage] After initializing optimizer states
[2021-10-10 00:05:29,959] [INFO] [utils.py:807:see_memory_usage] MA 5.12 GB Max_MA 5.85 GB CA 9.21 GB Max_CA 9 GB
[2021-10-10 00:05:29,959] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4%
[2021-10-10 00:05:29,959] [INFO] [stage2.py:474:init] optimizer state initialized
[2021-10-10 00:05:29,992] [INFO] [utils.py:806:see_memory_usage] After initializing ZeRO optimizer
[2021-10-10 00:05:29,992] [INFO] [utils.py:807:see_memory_usage] MA 5.12 GB Max_MA 5.12 GB CA 9.21 GB Max_CA 9 GB
[2021-10-10 00:05:29,993] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4%
[2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-10-10 00:05:29,993] [INFO] [engine.py:586:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fed99e7b520>
[2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]]
[2021-10-10 00:05:29,993] [INFO] [config.py:940:print] DeepSpeedEngine configuration:
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] allreduce_always_fp32 ........ False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] amp_enabled .................. False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] amp_params ................... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] checkpoint_tag_validation_enabled True
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] checkpoint_tag_validation_fail False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] curriculum_enabled ........... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] curriculum_params ............ False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dataloader_drop_last ......... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] disable_allgather ............ False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dump_state ................... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_enabled ........... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_gas_boundary_resolution 1
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_layer_name ........ bert.encoder.layer
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_layer_num ......... 0
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_max_iter .......... 100
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_stability ......... 1e-06
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_tol ............... 0.01
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_verbose ........... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] elasticity_enabled ........... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_enabled ................. True
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_master_weights_and_gradients False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_mixed_quantize .......... False
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] global_rank .................. 0
[2021-10-10 00:05:29,994] [INFO] [config.py:944:print] gradient_accumulation_steps .. 2
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] gradient_clipping ............ 1.0
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] gradient_predivide_factor .... 1.0
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] initial_dynamic_scale ........ 65536
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] loss_scale ................... 0
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] memory_breakdown ............. False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_legacy_fusion ...... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_name ............... adamw
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pld_enabled .................. False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pld_params ................... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] prescale_gradients ........... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_change_rate ......... 0.001
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_groups .............. 1
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_offset .............. 1000
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_period .............. 1000
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_rounding ............ 0
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_start_bits .......... 16
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_target_bits ......... 8
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_training_enabled .... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_type ................ 0
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_verbose ............. False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] scheduler_name ............... WarmupLR
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 10}
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] sparse_attention ............. None
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] sparse_gradients_enabled ..... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] steps_per_print .............. 2000
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_enabled .......... False
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_output_path ......
[2021-10-10 00:05:29,995] [INFO] [config.py:944:print] train_batch_size ............. 16
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] train_micro_batch_size_per_gpu 1
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] use_quantizer_kernel ......... False
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] wall_clock_breakdown ......... False
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] world_size ................... 8
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_allow_untested_optimizer False
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_config .................. {
"stage": 2,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false,
"ignore_unused_parameters": true,
"round_robin_gradients": false,
"legacy_stage1": false
}
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_enabled ................. True
[2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_optimization_stage ...... 2
[2021-10-10 00:05:29,996] [INFO] [config.py:946:print] json = {
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-06,
"betas": [0.9, 0.999],
"eps": 1e-08,
"weight_decay": 0.0
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 5e-06,
"warmup_num_steps": 10
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true,
"cpu_offload": false
},
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"steps_per_print": 2.000000e+03,
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
Using /home/user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005140304565429688 seconds
[INFO|trainer.py:1196] 2021-10-10 00:05:29,997 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-10-10 00:05:29,997 >> Num examples = 1081
[INFO|trainer.py:1198] 2021-10-10 00:05:29,997 >> Num Epochs = 1
[INFO|trainer.py:1199] 2021-10-10 00:05:29,997 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1200] 2021-10-10 00:05:29,997 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1201] 2021-10-10 00:05:29,997 >> Gradient Accumulation steps = 2
[INFO|trainer.py:1202] 2021-10-10 00:05:29,997 >> Total optimization steps = 68
22%|█████████▍ | 15/68 [00:15<00:53, 1.00s/it][INFO|trainer.py:2243] 2021-10-10 00:05:45,547 >> ***** Running Evaluation *****
[INFO|trainer.py:2245] 2021-10-10 00:05:45,547 >> Num examples = 274
[INFO|trainer.py:2248] 2021-10-10 00:05:45,547 >> Batch size = 8
{'eval_loss': 3.603515625, 'eval_runtime': 4.5688, 'eval_samples_per_second': 59.972, 'eval_steps_per_second': 1.094, 'epoch': 0.22}
44%|██████████████████▉ | 30/68 [00:35<00:38, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:05,226 >> ***** Running Evaluation *****
[INFO|trainer.py:2245] 2021-10-10 00:06:05,226 >> Num examples = 274
[INFO|trainer.py:2248] 2021-10-10 00:06:05,226 >> Batch size = 8
{'eval_loss': 3.2109375, 'eval_runtime': 4.586, 'eval_samples_per_second': 59.747, 'eval_steps_per_second': 1.09, 'epoch': 0.44}
66%|████████████████████████████▍ | 45/68 [00:54<00:23, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:24,942 >> ***** Running Evaluation *****
[INFO|trainer.py:2245] 2021-10-10 00:06:24,942 >> Num examples = 274
[INFO|trainer.py:2248] 2021-10-10 00:06:24,942 >> Batch size = 8
{'eval_loss': 3.046875, 'eval_runtime': 4.6037, 'eval_samples_per_second': 59.517, 'eval_steps_per_second': 1.086, 'epoch': 0.66}
88%|█████████████████████████████████████▉ | 60/68 [01:14<00:08, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:44,684 >> ***** Running Evaluation *****
[INFO|trainer.py:2245] 2021-10-10 00:06:44,684 >> Num examples = 274
[INFO|trainer.py:2248] 2021-10-10 00:06:44,684 >> Batch size = 8
{'eval_loss': 2.966796875, 'eval_runtime': 4.6098, 'eval_samples_per_second': 59.438, 'eval_steps_per_second': 1.085, 'epoch': 0.88}
99%|██████████████████████████████████████████▎| 67/68 [01:26<00:01, 1.17s/it]
...
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 87.3723, 'train_samples_per_second': 12.372, 'train_steps_per_second': 0.778, 'train_loss': 3.9335650275735294, 'epoch': 1.0}
100%|███████████████████████████████████████████| 68/68 [01:27<00:00, 1.28s/it]
[INFO|trainer.py:1995] 2021-10-10 00:06:57,382 >> Saving model checkpoint to finetuned
[INFO|configuration_utils.py:413] 2021-10-10 00:06:57,383 >> Configuration saved in finetuned/config.json
[INFO|modeling_utils.py:1041] 2021-10-10 00:07:12,168 >> Model weights saved in finetuned/pytorch_model.bin
[INFO|tokenization_utils_base.py:2034] 2021-10-10 00:07:12,169 >> tokenizer config file saved in finetuned/tokenizer_config.json
[INFO|tokenization_utils_base.py:2040] 2021-10-10 00:07:12,170 >> Special tokens file saved in finetuned/special_tokens_map.json
***** train metrics *****
epoch = 1.0
train_loss = 3.9336
train_runtime = 0:01:27.37
train_samples = 1081
train_samples_per_second = 12.372
train_steps_per_second = 0.778
10/10/2021 00:07:12 - INFO - main - *** Evaluate ***
[INFO|trainer.py:2243] 2021-10-10 00:07:12,291 >> ***** Running Evaluation *****
[INFO|trainer.py:2245] 2021-10-10 00:07:12,291 >> Num examples = 274
[INFO|trainer.py:2248] 2021-10-10 00:07:12,291 >> Batch size = 8
100%|█████████████████████████████████████████████| 5/5 [00:04<00:00, 1.18it/s]
***** eval metrics *****
epoch = 1.0
eval_loss = 2.9434
eval_runtime = 0:00:04.59
eval_samples = 274
eval_samples_per_second = 59.676
eval_steps_per_second = 1.089
perplexity = 18.9795
(gpt) user@user-X9DRG-HF:~/transformers/Finetune_GPTNEO_GPTJ6B/finetuning_repo$
pip install deepspeed
Requirement already satisfied: deepspeed in /home/user/gpt/lib/python3.8/site-packages (0.5.5+cd7967d)
Requirement already satisfied: triton in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.1.0)
Requirement already satisfied: psutil in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (5.8.0)
Requirement already satisfied: numpy in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.21.2)
Requirement already satisfied: tensorboardX==1.8 in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.8)
Requirement already satisfied: ninja in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.10.2.2)
Requirement already satisfied: tqdm in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (4.62.3)
Requirement already satisfied: packaging in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (21.0)
Requirement already satisfied: torch in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.9.1+cu111)
Requirement already satisfied: filelock in /home/user/gpt/lib/python3.8/site-packages (from triton->deepspeed) (3.3.0)
Requirement already satisfied: protobuf>=3.2.0 in /home/user/gpt/lib/python3.8/site-packages (from tensorboardX==1.8->deepspeed) (3.18.1)
Requirement already satisfied: six in /home/user/gpt/lib/python3.8/site-packages (from tensorboardX==1.8->deepspeed) (1.16.0)
Requirement already satisfied: pyparsing>=2.0.2 in /home/user/gpt/lib/python3.8/site-packages (from packaging->deepspeed) (2.4.7)
Requirement already satisfied: typing-extensions in /home/user/gpt/lib/python3.8/site-packages (from torch->deepspeed) (3.10.0.2)
Beta Was this translation helpful? Give feedback.
All reactions