-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Towards the end of training. I see following exception thrown
100%|██████████| 203/203 [08:00<00:00, 2.37s/it]
Traceback (most recent call last):
File "/home/khayam/notebooks/stanford_alpaca/train.py", line 222, in <module>
train()
File "/home/khayam/notebooks/stanford_alpaca/train.py", line 217, in train
trainer.save_state()
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 1045, in save_state
self.state.save_to_json(path)
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Tensor is not JSON serializable
{'loss': 1.0194, 'grad_norm': tensor(0.9940, device='cuda:0'), 'learning_rate': 1.0204081632653061e-07, 'epoch': 1.0}
{'train_runtime': 480.766, 'train_samples_per_second': 108.165, 'train_steps_per_second': 0.422, 'train_loss': 1.0709380231467374, 'epoch': 1.0}
I am running trainer like this
torchrun --nproc_per_node={PROCS} --master_port=8080 train.py \
--model_name_or_path {MODEL} \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir {OUTPUT} \
--num_train_epochs 1 \
--per_device_train_batch_size {BATCH} \
--per_device_eval_batch_size {BATCH} \
--gradient_accumulation_steps {GRADIENT} \
--evaluation_strategy 'no' \
--save_strategy 'steps' --save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 --weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type 'cosine' \
--logging_steps 1 \
--tf32 True \
--deepspeed {DEEPSPEED_CONFIG}
MODEL = "/mnt/dataset-storage/AI_MODELS/LLAMA-HF/llama-7b-hf/"
OUTPUT = "model_output"
DEEPSPEED_CONFIG="/mnt/dataset-storage/AI_MODELS/training/stanford_alpaca/configs/zero1.json"
PROCS=8
BATCH=4
GRADIENT=8
deepspeed==0.13.4
torch==2.2.1
accelerate==0.27.2
Metadata
Metadata
Assignees
Labels
No labels