-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
Thanks for open-sourcing this excellent work!
I noticed HunyuanVideo is different from some other implementations (e.g., CogVideo and Flux) when handling the prompt padding and the mask of mmdit's 3D Full Attetnion.
def _get_llama_prompt_embeds(self, prompt, ...):
text_inputs = self.tokenizer(
prompt,
max_length=max_sequence_length,
padding="max_length",
truncation=True,
return_attention_mask=True, # Explicitly request mask
)
prompt_embeds = self.text_encoder(
input_ids=text_input_ids,
attention_mask=prompt_attention_mask, # Pass mask to encoder
output_hidden_states=True,
)
Return both embeddings and mask
return prompt_embeds, prompt_attention_mask
HunyuanVideo'llama encoder returns prompt_attetnion_mask and use it in attention_processor
hidden_states = F.scaled_dot_product_attention(
query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
# the attention_mask is not None
)
But in CogVideo and Flux(both MMDiT architecture) :
if prompt_embeds is None:
prompt_embeds = self._get_t5_prompt_embeds(
prompt=prompt,
num_videos_per_prompt=num_videos_per_prompt,
max_sequence_length=max_sequence_length,
device=device,
dtype=dtype,
)
the t5_encoder not returns padding mask of prompt.
hidden_states = F.scaled_dot_product_attention(
query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
# the attention_mask is not None
)
I'm confused about what led to this difference.
- keep consistent with the training phase?
- Or did Hunyuan team done some experiments showing that adding masks improves performance?
Metadata
Metadata
Assignees
Labels
No labels