Skip to content

train.py fails with TypeError: Object of type Tensor is not JSON serializable #314

@khayamgondal

Description

@khayamgondal

Towards the end of training. I see following exception thrown

100%|██████████| 203/203 [08:00<00:00,  2.37s/it]
Traceback (most recent call last):
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 222, in <module>
    train()
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 217, in train
    trainer.save_state()
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 1045, in save_state
    self.state.save_to_json(path)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
    json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Tensor is not JSON serializable
{'loss': 1.0194, 'grad_norm': tensor(0.9940, device='cuda:0'), 'learning_rate': 1.0204081632653061e-07, 'epoch': 1.0}
{'train_runtime': 480.766, 'train_samples_per_second': 108.165, 'train_steps_per_second': 0.422, 'train_loss': 1.0709380231467374, 'epoch': 1.0}

I am running trainer like this

torchrun --nproc_per_node={PROCS} --master_port=8080 train.py \
                --model_name_or_path {MODEL} \
                --data_path ./alpaca_data.json \
                --bf16 True \
                --output_dir {OUTPUT}  \
                --num_train_epochs 1  \
                --per_device_train_batch_size {BATCH} \
                --per_device_eval_batch_size {BATCH} \
                --gradient_accumulation_steps {GRADIENT} \
                --evaluation_strategy 'no'  \
                --save_strategy 'steps'  --save_steps 2000 \
                --save_total_limit 1  \
                --learning_rate 2e-5  --weight_decay 0. \
                --warmup_ratio 0.03  \
                --lr_scheduler_type 'cosine' \
                --logging_steps 1 \
                --tf32 True \
                --deepspeed {DEEPSPEED_CONFIG}
MODEL = "/mnt/dataset-storage/AI_MODELS/LLAMA-HF/llama-7b-hf/"
OUTPUT = "model_output"

DEEPSPEED_CONFIG="/mnt/dataset-storage/AI_MODELS/training/stanford_alpaca/configs/zero1.json"

PROCS=8
BATCH=4
GRADIENT=8
deepspeed==0.13.4
torch==2.2.1
accelerate==0.27.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions