-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the reproducation of the results in the math_10k #58
Comments
Hi, Can I ask if you used multi-gpu for training? If yes, please try with single GPU. |
I use a single GPU. |
Hi, did you solve this problem? My results are close to yours. |
Unfortunately, I haven't made it yet. |
You can use transformers==4.35.0 These results will be close to authors |
Thank you so much!!! |
@Zhenyu001225 any idea why this happens? An extreme case is for transformers 4.40.0 which gave me gibberish output as mentioned in this issue. Thanks |
I think it's because of the tokenizer version. CUDA_VISIBLE_DEVICES=1 python finetune.py For Commense: CUDA_VISIBLE_DEVICES=8 python finetune.py |
Hi, can you kindly share your requirement.txt with versions? I think besides the version of transformers, the versions of accelerate and tokenizers also affect the results. Thank you so much! |
@Zhenyu001225 When switching to transformers 4.35.0, the training is very unstable as training loss goes to 0 and validation loss goes to nan. Do you have the same problem? |
Hi, I have the same problem. Did you solve it? |
@YYing0111 Try installing transformers with |
Hi, I finetuned the Llama-7B model using LoRA with math_10k on a single A100 GPU with transformers==4.35.0, but still got a much lower accuracy (27.2%) on SVAMP compared to the reported numbers (52.1%). From a manual analysis of the generated responses, it seems that the model is generating a lot of irrelevant code after finishing its reasoning steps. The final answer for math datasets is fetched using whatever is the last float number present in the response, however with some random code at the end, it fetches the numeric answer from the gibberish text instead of the actual answer, resulting in a drop in the accuracy. Here's an example:
Here it treats 300 as the answer since thats the last number in the generated response, while the actual reasoning by Llama is correct in the first half of the generation. Anyone knows how to fix this? Thanks! Edit: Also here's my ft command: |
Hi, thank you for your awesome work!
I have one question about the training on the math_10k dataset.
python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'ft-training_set/math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64
But I only get 16.14 on AQuA and 46.9 on SVAMP, but in the table it should be 18.9 on AQuA and 52.1 on SVAMP.
I'm using the peft library from the GitHub repo. Do you have any insights on this? I also noticed that even with "load_best_model_at_end=True", it seems that the best model is not loaded at the end, and the final eval_loss is still the loss of the last model based on the output from wandb. Is this correct?
Thank you so much in advance.
The text was updated successfully, but these errors were encountered: