train.py fails with TypeError: Object of type Tensor is not JSON serializable

Towards the end of training. I see following exception thrown

```
100%|██████████| 203/203 [08:00<00:00,  2.37s/it]
Traceback (most recent call last):
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 222, in <module>
    train()
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 217, in train
    trainer.save_state()
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 1045, in save_state
    self.state.save_to_json(path)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
    json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Tensor is not JSON serializable
{'loss': 1.0194, 'grad_norm': tensor(0.9940, device='cuda:0'), 'learning_rate': 1.0204081632653061e-07, 'epoch': 1.0}
{'train_runtime': 480.766, 'train_samples_per_second': 108.165, 'train_steps_per_second': 0.422, 'train_loss': 1.0709380231467374, 'epoch': 1.0}
```

I am running trainer like this
```
torchrun --nproc_per_node={PROCS} --master_port=8080 train.py \
                --model_name_or_path {MODEL} \
                --data_path ./alpaca_data.json \
                --bf16 True \
                --output_dir {OUTPUT}  \
                --num_train_epochs 1  \
                --per_device_train_batch_size {BATCH} \
                --per_device_eval_batch_size {BATCH} \
                --gradient_accumulation_steps {GRADIENT} \
                --evaluation_strategy 'no'  \
                --save_strategy 'steps'  --save_steps 2000 \
                --save_total_limit 1  \
                --learning_rate 2e-5  --weight_decay 0. \
                --warmup_ratio 0.03  \
                --lr_scheduler_type 'cosine' \
                --logging_steps 1 \
                --tf32 True \
                --deepspeed {DEEPSPEED_CONFIG}
```

```
MODEL = "/mnt/dataset-storage/AI_MODELS/LLAMA-HF/llama-7b-hf/"
OUTPUT = "model_output"

DEEPSPEED_CONFIG="/mnt/dataset-storage/AI_MODELS/training/stanford_alpaca/configs/zero1.json"

PROCS=8
BATCH=4
GRADIENT=8
```

```
deepspeed==0.13.4
torch==2.2.1
accelerate==0.27.2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

train.py fails with TypeError: Object of type Tensor is not JSON serializable #314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

train.py fails with TypeError: Object of type Tensor is not JSON serializable #314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions