Skip to content

Conversation

rockerBOO
Copy link
Contributor

@rockerBOO rockerBOO commented Oct 9, 2025

Add support for CDC-FM, a geometry-aware noise generation method that improves diffusion model training by adapting noise to the local geometry of the latent space. CDC-FM replaces standard Gaussian noise with geometry-informed noise that better preserves the structure of the data manifold.

Note: Only implemented for Flux network training so far. Can be expanded to other flow matching models SD3, Lumina Image 2.

Deep generative models often face a fundamental tradeoff: high sample qualitycan come at the cost of memorisation, where the model reproduces training datarather than generalising across the underlying data geometry. We introduce Carr´edu champ flow matching (CDC-FM), a generalisation of flow matching (FM), thatimproves the quality-generalisation tradeoff by regularising the probability pathwith a geometry-aware noise. Our method replaces the homogeneous, isotropicnoise in FM with a spatially varying, anisotropic Gaussian noise whose covari-ance captures the local geometry of the latent data manifold. We prove that thisgeometric noise can be optimally estimated from the data and is scalable to largedata. Further, we provide an extensive experimental evaluation on diverse datasets(synthetic manifolds, point clouds, single-cell genomics, animal motion capture,and images) as well as various neural network architectures (MLPs, CNNs, andtransformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FMin data-scarce regimes and in highly non-uniformly sampled datasets, which areoften encountered in AI for science applications. Our work provides a mathemat-ical framework for studying the interplay between data geometry, generalisationand memorisation in generative models, as well as a robust and scalable algorithmthat can be readily integrated into existing flow matching pipelines.

https://arxiv.org/abs/2510.05930

Screenshot 2025-10-09 at 18-21-47 Carr_'e du champ flow matching better quality-generalisation tradeoff in generative models - 2510 05930v1 pdf Screenshot 2025-10-09 at 18-21-29 Carr_'e du champ flow matching better quality-generalisation tradeoff in generative models - 2510 05930v1 pdf

Note: Written with AI but I guided how it was implemented.

Recommended Configurations:

Single Resolution (e.g., all 512×512):

  --use_cdc_fm \
  --cdc_k_neighbors 256 \
  --cdc_k_bandwidth 8 \
  --cdc_d_cdc 8 \
  --cdc_gamma 1.0

Multi-Resolution with Bucketing (FLUX/SDXL):

  --use_cdc_fm \
  --cdc_k_neighbors 256 \
  --cdc_adaptive_k \
  --cdc_min_bucket_size 16 \
  --cdc_k_bandwidth 8 \
  --cdc_d_cdc 8 \
  --cdc_gamma 0.5

Small Dataset (<1000 images):

   --use_cdc_fm \
  --cdc_k_neighbors 128 \
  --cdc_adaptive_k \
  --cdc_min_bucket_size 8 \
  --cdc_k_bandwidth 8 \
  --cdc_d_cdc 8 \
  --cdc_gamma 1.5

Parameter Guide:

--cdc_k_neighbors

  • Recommended: 256 (based on paper's CIFAR-10 experiments)
  • Small datasets (<1000): 128
  • Medium datasets (1000-10k): 256
  • Large datasets (>10k): 256-512
  • Rule: k = min(256, dataset_size / 4)

--cdc_adaptive_k

  • Recommended: Enable for multi-resolution/bucketed training
  • Without flag (default): Strict paper methodology - skips buckets with < k_neighbors samples
  • With flag: Pragmatic approach - uses k = min(k_neighbors, bucket_size - 1) for buckets ≥ min_bucket_size
  • When to use:
    • Multi-resolution training (FLUX with various aspect ratios)
    • Training with bucketing enabled
    • Datasets where resolution distribution varies widely
  • When not to use:
    • Single resolution datasets (all images same size)
    • When you want strict adherence to paper's methodology
    • Academic/research settings requiring exact paper reproduction

--cdc_min_bucket_size

  • Recommended: 16 (default)
  • Only relevant when --cdc_adaptive_k is enabled
  • Buckets below this threshold use Gaussian fallback (no CDC)
  • Range: 8-32 depending on dataset
  • Lower values (8-12): More buckets get CDC, but less stable for very small buckets
  • Higher values (24-32): More conservative, only well-populated buckets get CDC

--cdc_k_bandwidth

  • Recommended: 8 (paper uses this consistently)
  • Don't change unless you have specific reasons
  • This determines variable-bandwidth Gaussian kernels

--cdc_gamma

  • Small datasets (<1000): 1.0-2.0 (stronger regularization)
  • Medium datasets (1000-5000): 0.8-1.0
  • Large datasets (>5000): 0.5-0.8
  • Paper showed γ=2.0 optimal for 250 samples, γ=0.5-1.0 for 2000-5000 samples

--cdc_d_cdc

  • Recommended: 8-16 for high-dimensional image data
  • Paper tested 2, 4, 8, 16 - found trade-off between quality and generalization
  • Higher values capture more geometric structure but may include noise

Implements geometry-aware noise generation for FLUX training based on
arXiv:2510.05930v1.
- Cache all shapes during GammaBDataset initialization
- Eliminates file I/O on every training step (9.5M accesses/sec)
- Reduces get_shape() from file operation to dict lookup
- Memory overhead: ~126 bytes/sample (~12.6 MB per 100k images)
- Create apply_cdc_noise_transformation() for better modularity
- Implement fast path for batch processing when all shapes match
- Implement slow path for per-sample processing on shape mismatch
- Clone noise tensors in fallback path for gradient consistency
- Remove @torch.no_grad() decorator from compute_sigma_t_x()
- Gradients now properly flow through CDC transformation during training
- Add comprehensive gradient flow tests for fast/slow paths and fallback
- All 25 CDC tests passing
- Track warned samples in global set to prevent log spam
- Each sample only warned once per training session
- Prevents thousands of duplicate warnings during training
- Add tests to verify throttling behavior
- Check that noise and CDC matrices are on same device
- Automatically transfer noise if device mismatch detected
- Warn user when device transfer occurs
- Add tests to verify device handling
- Treat cuda and cuda:0 as compatible devices
- Only warn on actual device mismatches (cuda vs cpu)
- Eliminates warning spam during multi-subset training
Fixes shape mismatch bug in multi-subset training where CDC preprocessing
and training used different index calculations, causing wrong CDC data to
be loaded for samples.

Changes:
- CDC cache now stores/loads data using image_key strings instead of integer indices
- Training passes image_key list instead of computed integer indices
- All CDC lookups use stable image_key identifiers
- Improved device compatibility check (handles "cuda" vs "cuda:0")
- Updated all 30 CDC tests to use image_key-based access

Root cause: Preprocessing used cumulative dataset indices while training
used sorted keys, resulting in mismatched lookups during shuffled multi-subset
training.
- Add --cdc_debug flag to enable verbose bucket-by-bucket output
- When debug=False (default): Show tqdm progress bar, concise logging
- When debug=True: Show detailed bucket information, no progress bar
- Improves user experience during CDC cache generation
@kohya-ss
Copy link
Owner

kohya-ss commented Oct 9, 2025

Thank you for this! This seems to be effective when the data set is limited, so it looks very good.

I plan to merge the sd3 branch into main soon, so I'd like to merge this (and a few other PRs) before then.

- Add ss_use_cdc_fm, ss_cdc_k_neighbors, ss_cdc_k_bandwidth, ss_cdc_d_cdc, ss_cdc_gamma
- Ensures CDC-FM training parameters are tracked in model metadata
- Enables reproducibility and model provenance tracking
- Add --cdc_adaptive_k flag to enable adaptive k based on bucket size
- Add --cdc_min_bucket_size to set minimum bucket threshold (default: 16)
- Fixed mode (default): Skip buckets with < k_neighbors samples
- Adaptive mode: Use k=min(k_neighbors, bucket_size-1) for buckets >= min_bucket_size
- Update CDCPreprocessor to support adaptive k per bucket
- Add metadata tracking for adaptive_k and min_bucket_size
- Add comprehensive pytest tests for adaptive k behavior

This allows CDC-FM to work effectively with multi-resolution bucketing where
bucket sizes may vary widely. Users can choose between strict paper methodology
(fixed k) or pragmatic approach (adaptive k).
@rockerBOO
Copy link
Contributor Author

Issue right now is we are caching the neighbors into a file but saving it into the output_dir. this means each run we make a new file. We could:

  • Only save this in memory and not to a file.
  • Allow users to set the cache file location.

I'd usually set it with the dataset but if multiple subsets are set it isn't one place.

- Make FAISS import optional with try/except
- CDCPreprocessor raises helpful ImportError if FAISS unavailable
- train_util.py catches ImportError and returns None
- train_network.py checks for None and warns user
- Training continues without CDC-FM if FAISS not installed
- Remove benchmark file (not needed in repo)

This allows users to run training without FAISS dependency.
CDC-FM will be automatically disabled with a warning if FAISS is missing.
@FurkanGozukara
Copy link

@rockerBOO i plan to test this

this is only for flux lora?

- Add explicit warning and tracking for multiple unique latent shapes
- Simplify test imports by removing unused modules
- Minor formatting improvements in print statements
- Ensure log messages provide clear context about dimension mismatches
- Merged redundant test files
- Removed 'comprehensive' from file and docstring names
- Improved test organization and clarity
- Ensured all tests continue to pass
- Simplified test documentation
@rockerBOO
Copy link
Contributor Author

@rockerBOO i plan to test this

this is only for flux lora?

Yes only Flux LoRA for the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants