Skip to content

[BUG] sok amp mode error #462

@Orca-bit

Description

@Orca-bit

Describe the bug

[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 129, in <module>
[1,0]<stderr>:    trainer = Trainer(
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 161, in __init__
[1,0]<stderr>:    self._embedding_optimizer = tf.keras.mixed_precision.LossScaleOptimizer(
[1,0]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 343, in __call__
[1,0]<stderr>:    raise TypeError(msg)
[1,0]<stderr>:TypeError: "inner_optimizer" must be an instance of `tf.keras.optimizers.Optimizer` or `tf.keras.optimizers.experimental.Optimizer`, but got: <sparse_operation_kit.optimizer.OptimizerWrapperV2 object at 0x7f1b15b44910>.

To Reproduce
Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu xx.yy]
  • Graphic card: [e.g. a single NVIDIA V100 or NVIDIA DGX A100]
  • CUDA version: [e.g. CUDA 11.x]
  • Docker image nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
  • tf: 2.12.0+nv23.6

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions