Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: tuple index out of range #47

Open
zyushun opened this issue May 13, 2024 · 11 comments
Open

IndexError: tuple index out of range #47

zyushun opened this issue May 13, 2024 · 11 comments

Comments

@zyushun
Copy link

zyushun commented May 13, 2024

Hi Jiawei,

I was trying Galore on TinyLlama-1B using the codebase https://github.com/jzhang38/TinyLlama on 4* A800-80GB. I encounter the following error:

[rank1]:     optimizer.step()
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 74, in step
[rank1]:     output = self._strategy.optimizer_step(
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 207, in optimizer_step
[rank1]:     return self.precision.optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/fsdp.py", line 142, in optimizer_step
[rank1]:     return super().optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 124, in optimizer_step
[rank1]:     return optimizer.step(**kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step
[rank1]:     grad = state["projector"].project(grad, state["step"])
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 15, in project
[rank1]:     if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:
[rank1]: IndexError: tuple index out of range

I use galore as you suggested in torchrun_main.py:

print('using galore')
galore_params = []
target_modules_list = [ "attn", "mlp"]
for module_name, module in model.named_modules():
    if not isinstance(module, nn.Linear):
        continue

    if not any(target_key in module_name for target_key in target_modules_list):
        continue
    
    print('enable GaLore for weights in module: ', module_name)
    galore_params.append(module.weight)

id_galore_params = [id(p) for p in galore_params]

# make parameters without "rank" to another group
regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]
# then call galore_adamw
param_groups = [{'params': regular_params}, 
                {'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]
    
optimizer = GaLoreAdamW(param_groups, lr=learning_rate)

Any idea why and how to fix it?

Thanks in advance!

@nicosouth
Copy link

do you solve it? i have the same error

@zyushun
Copy link
Author

zyushun commented May 15, 2024

do you solve it? i have the same error

not yet

@Jackie0601zhou
Copy link

I have some problem as well.

@FabioDataGeek
Copy link

Same problem here, the structure of grouped parameters is:

List[Dict['params'], Dict['params', 'rank', 'update_proj_gap', 'scale', 'proj_type']]

where 'params' is a list of tensors. I'm trying with pretrained models from huggingface.

@FabioDataGeek
Copy link

I found the error. For the projection in a lower rank, you need tensors of dimension 2 (matrix). If parameters of dimension 1 as LayerNorm are added to the galore_params, it would raise the error trying to access the second dimension.

Make sure that ALL the parameters sent to galore_params have a second dimension.

@nicosouth
Copy link

I found the error. For the projection in a lower rank, you need tensors of dimension 2 (matrix). If parameters of dimension 1 as LayerNorm are added to the galore_params, it would raise the error trying to access the second dimension.

Make sure that ALL the parameters sent to galore_params have a second dimension.

do u use the deepspeed and train on multi-node? i just select the attn and mlp modules for galore.

@FabioDataGeek
Copy link

I'm just testing Galore on Local with only Pytorch; this is the function I've been using for grouping the parameters:

`

def galore_parameters(model):
    galore_params = []
    non_galore_params = []
    for name, param in model.named_parameters():

    if 'embeddings' in name and not 'LayerNorm' in name:
        galore_params.append(param)
        continue
    
    if 'layer' in name and 'weight' in name and not 'LayerNorm' in name:
        galore_params.append(param)
        continue

    if 'classifier' in name and not 'bias' in name:
        galore_params.append(param)
        continue
                  
    else:
        non_galore_params.append(param)
            
param_groups = [{'params': non_galore_params},
                {'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]   # 'proj_type': 'std', 'reverse_std','right', 'left', 'full'

for param in galore_params:
    if param.dim() != 2:
        raise ValueError('Galore only supports 2D parameters')

return param_groups

`

Consider the 'model' variable a pre-trained model loaded from Huggingface. I've checked the different layer names with the VSC debugger and set the 'if' statements accordingly, so you should change them for your specific model. For instance, 'classifier' applies only to classification heads on top of Language Models.

The test has been done with RoBERTa.

@dinhanhx
Copy link

dinhanhx commented Jun 3, 2024

From my wacky understanding, GaLore only works with nn.Linear().weight.

@FabioDataGeek
Copy link

Actually, it worked for me with the embeddings layers, which are from nn.Embedding().
I only found the above-mentioned problem when the current layer has only one dimension in its tensors i.e size = [768] instead of [768, 768]

@FabioDataGeek
Copy link

Specifically, the error comes from line 15 in galore_projector.py :

if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:

For the standard projection type it will compare the first and the second dimension of each tensor given in galore_params. You would have the same error in every projection type that compares both dimensions.

@Shinechaote
Copy link

I am facing the same issue. I am training using FSDP and use_orig_params=True, FSDP however still flattens the parameters. Has anyone faced this issue before and was able to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants