pytorch implementations of the above are pretty straightforward https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/clip_grad.py

I would say clip_grad_norm would be required to go through TensorCollection api so:

only the norm of the model's gradients are considered
We can get access to the gradient's tensor

model.clip_grad_norm(&mut grads, 0.5);
model.clip_grad_value(&mut grads, 0.5);

Then we could implement clip_grad_norm with two passes with RecursiveWalker:

Accumulate each gradient's norm. For each tensor & gradient:
1. Create a tensor out of the gradient using Gradients::get
2. Compute norm of tensor with g.square().sum().sqrt()
3. Append this 0d norm tensor to a Vec along the walker
Call stack on the Vec of 0d norms
Call stacked.square().sum().sqrt() to compute total norm
Multiply each gradient by max_norm / total_norm as done in pytorch code

If we wanted this all to be in-place:

For clip_grad_norm, we'd need a way to in-place multiply a D::Vec<E>.
For clip_grad_value, we'd need a way to in-place clamp a D::Vec<E>.

Also separately, the .square().sum().sqrt() way of taking norm may be expensive since .square() will allocate another tensor with the same size as the gradient. I think this can be addressed separately though.

Uh oh!

Gradient Clipping #596

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions