Releases · woct0rdho/triton-windows

Fix when multiple processes create __triton_launcher.pyd in parallel, see intel/intel-xpu-backend-for-triton#3270 . Now torch.compile autotune will work in general.

Note that ComfyUI enables cudaMalloc by default, but cudaMalloc does not work with CUDA graphs. Also, many models and nodes in ComfyUI are not compatible with CUDA graphs. You may use mode='max-autotune-no-cudagraphs' and see if it has speedup.