-
Notifications
You must be signed in to change notification settings - Fork 727
Deprecate convert_to_singleton #691
Comments
|
@andrewPoulton Is this true though? I was unable to convert 8 shards successfully to
|
@ayeeyecorp Can you share the stack trace? I suspect it might be related to the fact that the checkpoints available on the OPT page are flattened, which are are not compatible with |
@tangbinh let's add a flat param check to reshard_*, and raise an error unless user specifically wants to unflatten. I'll create an issue to track in a bit. Happy to own as well. |
@andrewPoulton I was adding an option to split the KVQ weights in
Once we fixed #625, I think we can safely remove |
@andrewPoulton - I did not save the stack trace from that particular test - can redo. However, here is the tail end snippet of the stack trace after running
The 992 shards were first converted to
Should I have set |
@ayeeyecorp Just so I'm clear - you first ran reshard_fsdp on the shards (with unflatten-weights=true), then tried running convert_to_singleton on the consolidated shards? If that's so, then can you try running reshard_mp on the consolidated shards instead? |
you first ran reshard_fsdp on the shards (with unflatten-weights=true), then tried running convert_to_singleton on the consolidated shards? Correct, this resulted in the If that's so, then can you try running reshard_mp on the consolidated shards instead? Will do that again shortly and post stack trace results. |
@ayeeyecorp May I ask why you want to convert the 8 MP parts of OPT 175B into a singleton? I don't think you would be able to load the singleton into any GPU considering its size, which is about 350GB.
|
I started over earlier today from the 992 shards (resetting my environment per the instructions here using Python3.8) and verified that the 8 consolidated FSDP shards had the correct md5sum. Upon confirmation, I converted the checkpoints, to eliminate use of MP, to 1 with the
Not sure what the original problem was. The md5sum of the single checkpoint (325.2 GB) was: The subsequent step to convert to hugging face using:
failed after 1+ hour with the following stack trace:
I followed @patrickvonplaten conversion instructions found here and generated a config.json with the following:
Thoughts on what could be going wrong with the HF conversion? I will re-run the operation overnight and log the full failure stack trace. @tangbinh - thank you for the clarification. I am converting the 8 MP parts of OPT 175B into a singleton to run quantization experiments against |
@ayeeyecorp For OPT 175B, we should have |
@tangbinh that was quick! brilliant, will give that a go now. I blindly used values from HF... thank you |
After updating instance to 1TB+ of RAM... I successfully generated a .bin file using Thanks for the support. |
As noted in #689, convert_to_singleton doesn't produce statedicts with compatible keys (for some unknown reason).
Since reshard_mp can do the same job, without the GPU node requirement of convert_to_singleton, we should deprecate convert_to_singleton.
TODO: Work out dependencies on covert_to_singleton, and identify any special cases it can handle that reshard_mp can't (such as separating out qkv weights, as noted by @tangbinh)
The text was updated successfully, but these errors were encountered: