You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your enlightening work in the paper !
I have a question about the routing strategy. The paper says:
'Note that, although the computation is conditional to the top-k experts, the required memory depends on the total number of experts.'
,which seems to imply that discrete routing strategy has no superiority comparing with soft merging in terms of memory cost.
But as far as I know, although the memory depends on the total number of experts, the discrete routing strategy can still save memory because we don't need to store the gradients and the activations of experts which are not activated.
If we take the above issue into accounts, it seems to be unfair to just compare the number of trainable prams among different peft method. Because the number of params can't equal to the memory exactly.
Could you give some insights about how to calculate the memory cost in moe situation? And how to compare different methods fairly?
Thank you for your reply!
The text was updated successfully, but these errors were encountered:
Thank you for your enlightening work in the paper !
I have a question about the routing strategy. The paper says:
'Note that, although the computation is conditional to the top-k experts, the required memory depends on the total number of experts.'
,which seems to imply that discrete routing strategy has no superiority comparing with soft merging in terms of memory cost.
But as far as I know, although the memory depends on the total number of experts, the discrete routing strategy can still save memory because we don't need to store the gradients and the activations of experts which are not activated.
If we take the above issue into accounts, it seems to be unfair to just compare the number of trainable prams among different peft method. Because the number of params can't equal to the memory exactly.
Could you give some insights about how to calculate the memory cost in moe situation? And how to compare different methods fairly?
Thank you for your reply!
The text was updated successfully, but these errors were encountered: