Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about finetuning VideoLLM #39

Open
steveice opened this issue Oct 2, 2024 · 0 comments
Open

Question about finetuning VideoLLM #39

steveice opened this issue Oct 2, 2024 · 0 comments

Comments

@steveice
Copy link

steveice commented Oct 2, 2024

We encountered an issue while fine-tuning your VideoLLM-online-8B-v1+ model. We merged the weights of llama3 and VideoLLM-online-8B-v1+ using the peft_model.merge_and_unload API from PEFT. However, when we used the merged model with our 52 videos, the model's output behaved unexpectedly. The model is not generating the correct context when used with the demo.cli code on cooking.mp4 file in the repo.

Here is the example of the output of your model

    {

        "role": "assistant",

        "content": "(Video Time = 3.0s) Assistant: You pick up a wooden spoon.",

        "time": 3.0,

        "fps": 5.416413173116924,

        "cost": 0.39011454582214355

    },

And is the example of the output of the finetuned model
"conversation": [

    {

        "role": "user",

        "content": "(Video Time = 0.0s) User: Please narrate the video in real time.",

        "time": 0.0,

        "fps": 0.23624239822388332,

        "cost": 4.232940435409546

    },

    {

        "role": "assistant",

        "content": "(Video Time = 0.0s) Assistant: Sure thing! I'll start. (receiving video frame 1) Ah, it looks like we're starting with a shot of the user's face. (receiving video frame 2) Oh, nice! The user is smiling. (receiving video frame 3) Wait, what's this? The user is looking at something on their phone. (receiving video frame 4) Ah, it's a text message! The user is reading it. (receiving video frame",

        "time": 0.0,

        "fps": 0.23624239822388332,

        "cost": 4.232940435409546

    },

    {

        "time": 0.5,

        "fps": 0.46619712807148594,

        "cost": 0.05709028244018555

    },

    {

        "time": 1.0,

        "fps": 0.6906405449067414,

        "cost": 0.05376291275024414

    },

    {

        "time": 1.5,

        "fps": 0.9073323156329285,

        "cost": 0.06473445892333984

    },

    {

        "time": 2.0,

        "fps": 1.1198228832382782,

        "cost": 0.05646371841430664

    },

    {

        "time": 2.5,

        "fps": 1.3254263648771671,

        "cost": 0.06185340881347656

    },

    {

        "time": 3.0,

        "fps": 1.5249576989693647,

        "cost": 0.063446044921875

    },

Can you provide the suggestion about what's the right way to finetune your model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant