-
Notifications
You must be signed in to change notification settings - Fork 2
Optimize batch size and batch step size #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Hi @wzzlcss: the comments in the PDF look promising. Can you say a bit more about what this PR is attempting to do? |
|
Hi mentor, I will complete benchmark with this PR right away, the formula that I am using is from https://github.com/gowerrobert/StochOpt.jl/blob/master/src/calculate_SAGA_rates_and_complexities.jl and they are explained in Gazagnadou, N., Gower, R., & Salmon, J., "Optimal mini-batch and step sizes for SAGA”, arXiv Preprint arXiv:1902.00071 (https://arxiv.org/abs/1902.00071). And I am using the practical approximation proposed by them. |
|
Very useful - thank you! I look forward to reading before reviewing this PR. Let me know when you're ready for a look-over. |
|
This PR is aimed to address the problem of mini-batch’s performance in ridge regression. In Figure 23-27 of the mini-batch paper (https://arxiv.org/abs/1902.00071), they study the relationship between the number of gradients to be calculated In Figure 16-21, they want to show that, with their optimized batch size and step size for ridge, mini-batch can use fewer epochs to reach saga’s loss (with b_Defazio = 1 and gamma_Defazio = 1/3(n\mu+L_max)). I use their scaled data covtype.binary (n = 581,012, d = 54) to generate Figure 16 (a). They only plot the pass of 1 to 6, where both of them have not converged. My result shows that mini-batch with their optimized batch and step size does use fewer npass to converge (this version changes any batch size input > 1 to the optimal). However, calculating their batch size and step size is not inexpensive, since they require a condition: “pp.6. if one eigenvalue of XX^T is significantly larger than the rest, then use L to approximate L_bar”. Without this modification, step size will get too small and mini-batch is very sensitive to it. So I don't know whether should I make them into our source code or leave options for user to set step size. |
|
Although the decrease in number of epoch at convergence point from saga to mini batch is not very large, mini batch requires more epoch if size and step was not modified, or when lambda was very small. So for optimization, it is important to speed up mini batch's one epoch, but I cannot find a way for vectorized computation with random indices set. |
|
Hi @wzzlcss, So I understand the above, do we actually need to know lambda_1(XX^T) and lambda_2(XX^T) (the first two eigenvalues of XX^T) or just the difference between them? There are fast ways to get approximate eigenvalues if we think it's super important- see, e.g., the |
|
Re: random indices. I'm pretty comfortable suggesting |

batch-opt.pdf