Question about attention mask of text_hidden_states

Thanks for open-sourcing this excellent work!

I noticed HunyuanVideo is different from some other implementations (e.g., CogVideo and Flux) when handling the prompt padding and the mask of mmdit's 3D Full Attetnion.
```
def _get_llama_prompt_embeds(self, prompt, ...):
    text_inputs = self.tokenizer(
        prompt,
        max_length=max_sequence_length,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,  # Explicitly request mask
    )
    prompt_embeds = self.text_encoder(
        input_ids=text_input_ids,
        attention_mask=prompt_attention_mask,  # Pass mask to encoder
        output_hidden_states=True,
    )
    
     Return both embeddings and mask
    return prompt_embeds, prompt_attention_mask
```
HunyuanVideo'llama encoder returns prompt_attetnion_mask and use it in attention_processor
```
 hidden_states = F.scaled_dot_product_attention(
            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
          # the attention_mask is not None 
        )
```

But in CogVideo and Flux(both MMDiT architecture) :
```
if prompt_embeds is None:
            prompt_embeds = self._get_t5_prompt_embeds(
                prompt=prompt,
                num_videos_per_prompt=num_videos_per_prompt,
                max_sequence_length=max_sequence_length,
                device=device,
                dtype=dtype,
            )
```
the t5_encoder not returns padding mask of prompt.
```
hidden_states = F.scaled_dot_product_attention(
            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
         #  the attention_mask is not None 
        )
```
I'm confused about what led to this difference.

-  keep consistent with the training phase?
-  Or did Hunyuan team done some experiments showing that adding masks improves performance?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about attention mask of text_hidden_states #275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about attention mask of text_hidden_states #275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions