Skip to content

Add a suggestion using the One Cycle Policy with Gradient clipping. #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Shuyib
Copy link
Contributor

@Shuyib Shuyib commented Jul 6, 2024

I have added a section in the notebook that goes through the One cycle Policy using the PyTorch method with gradient clipping. I have also added a short description of the method which was edited by Claude 3.5 sonnet. It may not be the best but I welcome corrections, and I hope this method will be part of the appendix.

Note: I did not run the code in the cell on this pull request.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Owner

rasbt commented Jul 6, 2024

Thanks for the PR, but I currently can't accept additions to the chapters and the Appendix because they have already been layouted by the publisher and it would be confusing for readers if the code in the notebook would be different than the code in the chapter. Thanks for contributing though!

Btw in the code you mention one-cycle policy, but the code in the appendix already implements this. So, may I ask how it's different and why it's needed?

@Shuyib
Copy link
Contributor Author

Shuyib commented Jul 7, 2024

Thank you for your response to the pull request. I greatly appreciate it. The techniques indeed share similarities; however, I believe the core distinction lies in the balance between exploration and exploitation. This is exemplified by the value of the learning rate, which initially increases from a low value to a high value, then subsequently decreases.

@Shuyib Shuyib closed this Jul 7, 2024
@rasbt
Copy link
Owner

rasbt commented Jul 10, 2024

Oh I see, I at first I didn't see the difference because there wasn't a plot and assumed by the title that it was similar but I think I understand the difference now. It's basically a full cycle whereas in the Appendix I have a half-cycle.

To be honest, I do prefer the current half-cycle implementation because that's how it is commonly done in LLM research (e.g., see https://arxiv.org/pdf/2403.08763). I haven't seen any LLM trained with a one-cycle (as opposed to half-cycle) policy, so I'd be a bit hesitant recommending that. Thanks for the PR and the discussion though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants