Skip to content

Conversation

@wzzlcss
Copy link
Collaborator

@wzzlcss wzzlcss commented Aug 15, 2019

@wzzlcss wzzlcss added the optimization Changes relating to optimization and performance label Aug 15, 2019
@michaelweylandt
Copy link
Collaborator

Hi @wzzlcss: the comments in the PDF look promising. Can you say a bit more about what this PR is attempting to do?

@wzzlcss
Copy link
Collaborator Author

wzzlcss commented Aug 21, 2019

Hi mentor, I will complete benchmark with this PR right away, the formula that I am using is from https://github.com/gowerrobert/StochOpt.jl/blob/master/src/calculate_SAGA_rates_and_complexities.jl and they are explained in Gazagnadou, N., Gower, R., & Salmon, J., "Optimal mini-batch and step sizes for SAGA”, arXiv Preprint arXiv:1902.00071 (https://arxiv.org/abs/1902.00071). And I am using the practical approximation proposed by them.

@michaelweylandt
Copy link
Collaborator

Very useful - thank you! I look forward to reading before reviewing this PR. Let me know when you're ready for a look-over.

@wzzlcss
Copy link
Collaborator Author

wzzlcss commented Aug 23, 2019

This PR is aimed to address the problem of mini-batch’s performance in ridge regression.

In Figure 23-27 of the mini-batch paper (https://arxiv.org/abs/1902.00071), they study the relationship between the number of gradients to be calculated thresh = 1e-4 and mini-batch size. The picture shows that, this number explodes for too large batch size.

In Figure 16-21, they want to show that, with their optimized batch size and step size for ridge, mini-batch can use fewer epochs to reach saga’s loss (with b_Defazio = 1 and gamma_Defazio = 1/3(n\mu+L_max)).

I use their scaled data covtype.binary (n = 581,012, d = 54) to generate Figure 16 (a). They only plot the pass of 1 to 6, where both of them have not converged. My result shows that mini-batch with their optimized batch and step size does use fewer npass to converge (this version changes any batch size input > 1 to the optimal).

batch
batch_test.pdf

However, calculating their batch size and step size is not inexpensive, since they require a condition: “pp.6. if one eigenvalue of XX^T is significantly larger than the rest, then use L to approximate L_bar”. Without this modification, step size will get too small and mini-batch is very sensitive to it.

So I don't know whether should I make them into our source code or leave options for user to set step size.

@wzzlcss
Copy link
Collaborator Author

wzzlcss commented Aug 23, 2019

Although the decrease in number of epoch at convergence point from saga to mini batch is not very large, mini batch requires more epoch if size and step was not modified, or when lambda was very small. So for optimization, it is important to speed up mini batch's one epoch, but I cannot find a way for vectorized computation with random indices set.

@michaelweylandt
Copy link
Collaborator

Hi @wzzlcss,

So I understand the above, do we actually need to know lambda_1(XX^T) and lambda_2(XX^T) (the first two eigenvalues of XX^T) or just the difference between them?

There are fast ways to get approximate eigenvalues if we think it's super important- see, e.g., the irbla package.

@michaelweylandt
Copy link
Collaborator

Re: random indices. I'm pretty comfortable suggesting cyclic = TRUE, minibatch = OPTIMIZED as defaults, so I wouldn't worry too much about making cyclic = FALSE minibatches super-optimized for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization Changes relating to optimization and performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants