[Reproduction Issue] Low Accuracy on MSVD/MSRVTT, No Results on NextQA, and Runtime Error on Egoschema

Hi @authors, 

First of all, thank you for open-sourcing this impressive work! I've tried to reproduce the results from your paper, but I encountered the following issues:

### Environment
- **Python**: 3.10.14
- **PyTorch**: 2.2.0
- **CUDA**: 12.1
- **Code version**: main branch (commit `bea0f73`)

### Issue 1: Low Accuracy on MSVD and MSRVTT Datasets
- **Expected**:
  - MSVD: 79.1/4.1 (Table 1 in paper)
  - MSRVTT: 65.8/3.6





- **Actual**:
  - MSVD: 61.6/3.4
  - MSRVTT: 46.0/2.83
- **Config Used**: `configs/slowfast_llava_7b-resize-slow_10frms_spatial_1d_max_pool_fast_4x4-50_frms.yaml`

### Issue 2: No Results on NextQA Dataset
**Steps**:
```bash
python run_inference.py --exp_config ./cfgs/slowfast_llava_7b-resize-slow_10frms_spatial_1d_max_pool_fast_4x4-50_frms.yaml
```
**Output**:
```bash
:::: Start Inference ::::
evaluating nextqa ...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:28<00:00, 29.42s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4996/4996 [00:00<00:00, 16708.75it/s]
0it [00:00, ?it/s]
```
**Logs**:
```log
[2025-03-06 23:34:10,273] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
...
```
The output indicates that the inference process started but no results were produced. 

### Issue 3: Runtime Error on Egoschema Dataset
**Steps**:
```bash
python run_inference.py --exp_config ./cfgs/slowfast_llava_7b-resize-slow_10frms_spatial_1d_max_pool_fast_4x4-50_frms.yaml
```
**Error**:
```bash
evaluating egoschema ...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:35<00:00, 31.73s/it]
  0%|                                                                                                                                                         | 0/500 [00:00<?, ?it/s]The `seq_len` argument is deprecated and unused. It will be removed in v4.39.
  0%|▌                                                                                                                                              | 2/500 [00:27<1:53:40, 13.70s/it]
Traceback (most recent call last):
  File "/root/private_data/ml-slowfast-llava/run_inference_multiple_choice_qa.py", line 182, in <module>
    run_inference(args)
  File "/root/private_data/ml-slowfast-llava/run_inference_multiple_choice_qa.py", line 133, in run_inference
    output = llava_inference(
  File "/root/private_data/ml-slowfast-llava/run_inference_multiple_choice_qa.py", line 54, in llava_inference
    output_ids = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/private_data/ml-slowfast-llava/slowfast_llava/llava/model/language_model/llava_llama.py", line 138, in generate
    return super().generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2404, in greedy_search
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/private_data/ml-slowfast-llava/slowfast_llava/llava/model/language_model/llava_llama.py", line 91, in forward
    return super().forward(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1176, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 993, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1079, in _update_causal_mask
    padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
RuntimeError: The size of tensor a (4096) must match the size of tensor b (4097) at non-singleton dimension 3
```

### Questions
- Are there any dataset-specific hyperparameters not mentioned in the repo?
- Is there any additional data preprocessing required for NextQA or Egoschema datasets?

Looking forward to your feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reproduction Issue] Low Accuracy on MSVD/MSRVTT, No Results on NextQA, and Runtime Error on Egoschema #5

Environment

Issue 1: Low Accuracy on MSVD and MSRVTT Datasets

Issue 2: No Results on NextQA Dataset

Issue 3: Runtime Error on Egoschema Dataset

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Reproduction Issue] Low Accuracy on MSVD/MSRVTT, No Results on NextQA, and Runtime Error on Egoschema #5

Description

Environment

Issue 1: Low Accuracy on MSVD and MSRVTT Datasets

Issue 2: No Results on NextQA Dataset

Issue 3: Runtime Error on Egoschema Dataset

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions