-
Notifications
You must be signed in to change notification settings - Fork 1k
HunyuanImage-2.1 #1044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: diffsynth-2.0
Are you sure you want to change the base?
HunyuanImage-2.1 #1044
Conversation
Summary of ChangesHello @yjy415, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the full HunyuanImage 2.1 model architecture and its associated inference pipeline. The changes encompass a novel Diffusion Transformer (DiT) for robust image generation, a Variational Autoencoder (VAE) for efficient latent space manipulation, and a dual text encoding system utilizing both Qwen and ByT5 models to enhance prompt understanding, including support for glyphs and OCR-related features. The new pipeline provides a streamlined and modular approach to text-to-image synthesis, complete with custom scheduling and adaptive guidance mechanisms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the HunyuanImage-2.1 model, adding new model implementations for the Diffusion Transformer (DiT), VAE, and text encoders (Qwen and T5), along with a new inference pipeline and corresponding state dictionary converters. The changes are extensive and well-structured. My review identified a few critical issues that could lead to runtime errors, such as improper handling of optional parameters and incorrect attribute access in the model implementations. I have also provided suggestions to improve code clarity, maintainability, and correctness by removing dead code, simplifying redundant logic, and correcting type hints.
diffsynth/models/hunyuanimage_dit.py
Outdated
| qkv_stacked = torch.stack([q, k, v], dim=2) | ||
| attn = flash_attn_no_pad( | ||
| qkv_stacked, | ||
| key_padding_mask=attn_mask.bool(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call attn_mask.bool() will raise an AttributeError if attn_mask is None. This can happen if use_attention_mask is set to False in HYImageDiffusionTransformer, as the None value for the mask is passed down to this function.
You should handle the case where attn_mask is None by providing a default mask. Also, the .bool() call is redundant because the mask is already converted to a boolean tensor in IndividualTokenRefiner.forward.
| key_padding_mask=attn_mask.bool(), | |
| key_padding_mask=attn_mask if attn_mask is not None else torch.ones(x.shape[:2], dtype=torch.bool, device=x.device), |
diffsynth/models/hunyuanimage_vae.py
Outdated
| return_dict : bool | ||
| Whether to return a dict. | ||
| """ | ||
| posterior = self.encode(sample).latent_dist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encode method returns a DiagonalGaussianDistribution object directly, which does not have a latent_dist attribute. Accessing .latent_dist will cause an AttributeError. You should use the returned posterior object directly.
| posterior = self.encode(sample).latent_dist | |
| posterior = self.encode(sample) |
diffsynth/models/hunyuanimage_dit.py
Outdated
| use_gradient_checkpointing_offload: bool = False, | ||
| *, | ||
| timesteps_r: Optional[torch.LongTensor] = None, | ||
| ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint for this function is Union[torch.Tensor, Dict[str, torch.Tensor]], but the function actually returns a tuple (img, features_list, shape). This inconsistency can be misleading. Please update the type hint to match the actual return type for better code clarity and correctness.
| ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]: | |
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor], tuple]: |
diffsynth/models/hunyuanimage_dit.py
Outdated
| else: | ||
| raise ValueError(f"Unsupported input_shape: {input_shape}") | ||
|
|
||
| assert not return_dict, "return_dict is not supported." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert for control flow, especially for parameter validation, is not recommended. Asserts can be disabled globally (e.g., with Python's -O flag), which would cause this check to be skipped. It's better to raise a NotImplementedError to make it clear that this functionality is not supported.
| assert not return_dict, "return_dict is not supported." | |
| if return_dict: raise NotImplementedError("return_dict is not supported.") |
diffsynth/pipelines/hunyuanimage.py
Outdated
| noise_pred_text_true = noise_pred_text[true_idx] if len(true_idx) > 0 else \ | ||
| torch.empty((0, noise_pred_text.size(1)), dtype=noise_pred_text.dtype, device=noise_pred_text.device) | ||
| noise_pred_text_false = noise_pred_text[false_idx] if len(false_idx) > 0 else \ | ||
| torch.empty((0, noise_pred_text.size(1)), dtype=noise_pred_text.dtype, device=noise_pred_text.device) | ||
|
|
||
| noise_pred_uncond_true = noise_pred_uncond[true_idx] if len(true_idx) > 0 else \ | ||
| torch.empty((0, noise_pred_uncond.size(1)), dtype=noise_pred_uncond.dtype, device=noise_pred_uncond.device) | ||
| noise_pred_uncond_false = noise_pred_uncond[false_idx] if len(false_idx) > 0 else \ | ||
| torch.empty((0, noise_pred_uncond.size(1)), dtype=noise_pred_uncond.dtype, device=noise_pred_uncond.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional logic to handle empty tensors is redundant. Indexing a tensor with an empty index tensor (like true_idx or false_idx when they are empty) will correctly return an empty tensor of the appropriate shape. You can simplify this code by removing the if len(...) > 0 checks and the explicit creation of empty tensors.
| noise_pred_text_true = noise_pred_text[true_idx] if len(true_idx) > 0 else \ | |
| torch.empty((0, noise_pred_text.size(1)), dtype=noise_pred_text.dtype, device=noise_pred_text.device) | |
| noise_pred_text_false = noise_pred_text[false_idx] if len(false_idx) > 0 else \ | |
| torch.empty((0, noise_pred_text.size(1)), dtype=noise_pred_text.dtype, device=noise_pred_text.device) | |
| noise_pred_uncond_true = noise_pred_uncond[true_idx] if len(true_idx) > 0 else \ | |
| torch.empty((0, noise_pred_uncond.size(1)), dtype=noise_pred_uncond.dtype, device=noise_pred_uncond.device) | |
| noise_pred_uncond_false = noise_pred_uncond[false_idx] if len(false_idx) > 0 else \ | |
| torch.empty((0, noise_pred_uncond.size(1)), dtype=noise_pred_uncond.dtype, device=noise_pred_uncond.device) | |
| noise_pred_text_true = noise_pred_text[true_idx] | |
| noise_pred_text_false = noise_pred_text[false_idx] | |
| noise_pred_uncond_true = noise_pred_uncond[true_idx] | |
| noise_pred_uncond_false = noise_pred_uncond[false_idx] |
| hidden_size_map = {} | ||
| for k in state_dict: | ||
| if k.startswith("single_blocks") and "linear2" in k: | ||
| hidden_size_map[k.replace("linear2", "linear1")] = state_dict[k].shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9aa5c16 to
7b756a5
Compare
No description provided.