Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about Routing Strategy: Soft vs Discrete #2

Open
pierowu opened this issue Jan 8, 2024 · 1 comment
Open

Some questions about Routing Strategy: Soft vs Discrete #2

pierowu opened this issue Jan 8, 2024 · 1 comment

Comments

@pierowu
Copy link

pierowu commented Jan 8, 2024

Thank you for your enlightening work in the paper !

I have a question about the routing strategy. The paper says:

'Note that, although the computation is conditional to the top-k experts, the required memory depends on the total number of experts.'

,which seems to imply that discrete routing strategy has no superiority comparing with soft merging in terms of memory cost.

But as far as I know, although the memory depends on the total number of experts, the discrete routing strategy can still save memory because we don't need to store the gradients and the activations of experts which are not activated.

If we take the above issue into accounts, it seems to be unfair to just compare the number of trainable prams among different peft method. Because the number of params can't equal to the memory exactly.

Could you give some insights about how to calculate the memory cost in moe situation? And how to compare different methods fairly?

Thank you for your reply!

@wutaiqiang
Copy link

The straightforward way is to compare the GPU Memory used I think. But it varies as the platform changes or even the torch version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants