You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We encountered an issue while fine-tuning your VideoLLM-online-8B-v1+ model. We merged the weights of llama3 and VideoLLM-online-8B-v1+ using the peft_model.merge_and_unload API from PEFT. However, when we used the merged model with our 52 videos, the model's output behaved unexpectedly. The model is not generating the correct context when used with the demo.cli code on cooking.mp4 file in the repo.
Here is the example of the output of your model
{
"role": "assistant",
"content": "(Video Time = 3.0s) Assistant: You pick up a wooden spoon.",
"time": 3.0,
"fps": 5.416413173116924,
"cost": 0.39011454582214355
},
And is the example of the output of the finetuned model
"conversation": [
{
"role": "user",
"content": "(Video Time = 0.0s) User: Please narrate the video in real time.",
"time": 0.0,
"fps": 0.23624239822388332,
"cost": 4.232940435409546
},
{
"role": "assistant",
"content": "(Video Time = 0.0s) Assistant: Sure thing! I'll start. (receiving video frame 1) Ah, it looks like we're starting with a shot of the user's face. (receiving video frame 2) Oh, nice! The user is smiling. (receiving video frame 3) Wait, what's this? The user is looking at something on their phone. (receiving video frame 4) Ah, it's a text message! The user is reading it. (receiving video frame",
"time": 0.0,
"fps": 0.23624239822388332,
"cost": 4.232940435409546
},
{
"time": 0.5,
"fps": 0.46619712807148594,
"cost": 0.05709028244018555
},
{
"time": 1.0,
"fps": 0.6906405449067414,
"cost": 0.05376291275024414
},
{
"time": 1.5,
"fps": 0.9073323156329285,
"cost": 0.06473445892333984
},
{
"time": 2.0,
"fps": 1.1198228832382782,
"cost": 0.05646371841430664
},
{
"time": 2.5,
"fps": 1.3254263648771671,
"cost": 0.06185340881347656
},
{
"time": 3.0,
"fps": 1.5249576989693647,
"cost": 0.063446044921875
},
Can you provide the suggestion about what's the right way to finetune your model?
The text was updated successfully, but these errors were encountered:
We encountered an issue while fine-tuning your VideoLLM-online-8B-v1+ model. We merged the weights of llama3 and VideoLLM-online-8B-v1+ using the peft_model.merge_and_unload API from PEFT. However, when we used the merged model with our 52 videos, the model's output behaved unexpectedly. The model is not generating the correct context when used with the demo.cli code on cooking.mp4 file in the repo.
Here is the example of the output of your model
And is the example of the output of the finetuned model
"conversation": [
Can you provide the suggestion about what's the right way to finetune your model?
The text was updated successfully, but these errors were encountered: