Xpu stage #18

Chao1Han · 2025-07-14T07:51:44Z

Fixes #ISSUE_NUMBER

Instead of every shader defining it separately, move it to `c10/metal/common.h` Pull Request resolved: pytorch#157751 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: pytorch#157746

for pytorch#157018 doesn't totally fix the problem but should help alot Pull Request resolved: pytorch#157745 Approved by: https://github.com/Chillee

…57692) Pull Request resolved: pytorch#157692 Approved by: https://github.com/ezyang

…pytorch#156032) See also: https://github.com/pytorch/pytorch/blob/54976bca103fcf2b5037cc0cd1b37c4639fcf779/.gitattributes#L1 Pull Request resolved: pytorch#156032 Approved by: https://github.com/seemethere, https://github.com/ezyang

@clee2000

# Motivation pytorch#155451 decoupled `torch._C._storage_Use_Count` from CUDA and introduced a corresponding unit test: https://github.com/pytorch/pytorch/blob/815545f2dd6ade563cb1263f8bb7813f355edb2e/test/test_torch.py#L257-L262 However, this test fails when PyTorch is built with debug assertions enabled. @clee2000 disabled this UT in pytorch#156731. The root cause is that `_cdata` is obtained from an `intrusive_ptr`, not a `weak_intrusive_ptr`. As a result, calling `c10::weak_intrusive_ptr::use_count` on it triggers the internal assertion: https://github.com/pytorch/pytorch/blob/815545f2dd6ade563cb1263f8bb7813f355edb2e/c10/util/intrusive_ptr.h#L912-L917 For example: ```python a = torch.randn(10, device=device) # refcount=1, weakcount=1 prev_cf = torch._C._storage_Use_Count(a.untyped_storage()._cdata) # violate the assertation ``` This violates the expected invariant inside `weak_intrusive_ptr::use_count`, which assumes the pointer was originally constructed from a valid `weak_intrusive_ptr`. Actually, `storage_impl` is obtained from an `intrusive_ptr`. https://github.com/pytorch/pytorch/blob/815545f2dd6ade563cb1263f8bb7813f355edb2e/torch/csrc/Module.cpp#L2105-L2109 # Solution Use `c10::intrusive_ptr::use_count` instead. Pull Request resolved: pytorch#157694 Approved by: https://github.com/albanD

Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier. Pull Request resolved: pytorch#156545 Approved by: https://github.com/jeffdaily, https://github.com/clee2000

(not really fix these issues, but we should be able to close them. This also allows CI from the PR to test them) Fixes pytorch#156579 Fixes pytorch#156580 Fixes pytorch#126867 Pull Request resolved: pytorch#157756 Approved by: https://github.com/clee2000

Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763) Pull Request resolved: pytorch#157708 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#157706

…rch#156017) Pull Request resolved: pytorch#156017 Approved by: https://github.com/ezyang

) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Pull Request resolved: pytorch#149601 Approved by: https://github.com/albanD

…ytorch#156581) In this PR, we are enabling `HPU` device-specific function calls for random operations. These calls will manage the setting and unsetting of the `context of Random Number Generator`. While HPU devices typically utilize a `Mersenne-based RNG`, Dtensor-specific random operations employ an `offset-based (Philox) RNG tracker` which is specifically integrated with `CUDA` in scope. To integrate a similar offset-based RNG tracker within the `HPU backend`, a backend-specific device handle function is necessary to identify the execution context of these random operations. Pull Request resolved: pytorch#156581 Approved by: https://github.com/jeromean, https://github.com/wanchaol

Update s390x test marks test_logs_out from test/dynamo/test_logging.py is updated and no longer fails on s390x. test_qengine from test/test_torch.py doesn't work on s390x: no QEngine is available. Pull Request resolved: pytorch#157541 Approved by: https://github.com/huydhn

Following [ pytorch#131858 suggestion](pytorch#131858 (review)) to optimize DataLoader code Pull Request resolved: pytorch#146821 Approved by: https://github.com/divyanshk Co-authored-by: Divyansh Khanna <[email protected]>

Use CMake wholearchive group to simplify code. It may also support more OSes. Pull Request resolved: pytorch#156393 Approved by: https://github.com/ezyang

…157733) This is useful for vLLM, which runs AOTAutograd directly on graphs after they have been split. I created a new flag for this instead of reusing `keep_original_node_name` (please let me know if you think I should reuse this). The reasoning is: - The names of the placeholder nodes is different from the targets of the placehoder nodes. The targets are the actual input names. - Backwards compatibility: this API has been out for ~4 years, it looks public, and it has extensive public use. For example, this change would actually be BC-breaking to vLLM (they rely on the subgraph input names being different at the moment). Test Plan: - new tests Pull Request resolved: pytorch#157733 Approved by: https://github.com/ezyang

XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10 Pull Request resolved: pytorch#157356 Approved by: https://github.com/EikanWang, https://github.com/malfet

Pull Request resolved: pytorch#157562 Approved by: https://github.com/yf225

This reverts commit 214e295. Reverted pytorch#156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](pytorch#156898 (comment)))

When building/testing PyTorch on MacOS Shoudl prevent some flakiness when conda environment overtakes CI/CD Pull Request resolved: pytorch#157749 Approved by: https://github.com/atalman, https://github.com/huydhn

As it gets included into auto-hrefed URLs in say github logs to point to non existing location For example from https://github.com/pytorch/pytorch/actions/runs/16130448756/job/45517004735?pr=157749#step:18:27 > W0708 00:23:20.150000 67082 torch/_dynamo/convert_frame.py:1047] [0/8] To diagnose recompilation issues, see [https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.](https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.) Pull Request resolved: pytorch#157753 Approved by: https://github.com/zou3519, https://github.com/jansel

Summary: I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures. At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance. Test Plan: I'm fairly confident from local testing that this should be a no-op. But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed. Pull Request resolved: pytorch#157757 Approved by: https://github.com/ezyang, https://github.com/Skylion007

Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL. Pull Request resolved: pytorch#157108 Approved by: https://github.com/atalman

@wanchaol

…rch#157216) This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan pytorch/torchtitan#1324. It does two things: 1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely. 2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more. Pull Request resolved: pytorch#157216 Approved by: https://github.com/wanchaol, https://github.com/weifengpy

…iple meshes (pytorch#157682) We are seeing more and more use cases where parameters in a model (under the same optimizer group) are put on different meshes. E.g. - when FSDP and TP are both applied, some parameters are sharded only on the FSDP mesh but not TP mesh (see pytorch#153268). - in [dp2ep Expert Parallel](pytorch/torchtitan#1324), the routed experts are sharded on the (global FSDP \ EP) mesh for smaller FSDP and on the EP mesh for EP, whereas other params are sharded on the global FSDP mesh for FSDP. This PR is, in some sense, a continuation of pytorch#147869 to tackle the problem when fused optimizers are used. In such cases, the [`fused_adam`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L15786) / `fused_adamw` has a scalar tensor arg `state_steps` which gets automatically cast to DTensor on the default [`compute_mesh`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_dispatch.py#L350) (one of the multiple meshes), even though the it could correspond to different meshes. To avoid hitting the cross-mesh propagation exception in `common_pointwise_strategy` and followup redistribute problems, we manually set the target mesh and placements to be the same as input mesh and placements, so that no redistribute will be triggered. This also helps bypass the situation where [`generate_redistribute_costs`](https://github.com/pytorch/pytorch/pull/157682/files#diff-eea32a36dd2d4e58307bc5229402e48048b2ecaef64a7c085495fba1ee10ac89R597) returns infinite cost due to cross mesh redistribute. Moreover, this PR has minimal scope (restricted to the `fused_ops`) and doesn't need to modify other files such as `_sharding_prop.py`. Pull Request resolved: pytorch#157682 Approved by: https://github.com/wanchaol

…h#156656) Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/) Motivation: * This is the case we care the most. * We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster. Pull Request resolved: pytorch#156656 Approved by: https://github.com/ColinPeppler

Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else. Fixes pytorch#152701 Pull Request resolved: pytorch#157772 Approved by: https://github.com/Skylion007

…#156920) We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device. This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device. Pull Request resolved: pytorch#156920 Approved by: https://github.com/soulitzer

Summary: Fix pytorch#157401 torch.equal cannot handle FakeScriptObject inputs. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_aoti_torchbind_name_collision ``` Rollback Plan: Differential Revision: D77894081 Pull Request resolved: pytorch#157736 Approved by: https://github.com/angelayi

Pull Request resolved: pytorch#157589 Approved by: https://github.com/Skylion007

… improved readability (pytorch#157735) There are 31 places that I spotted which construct literal dictionaries. This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable. Pull Request resolved: pytorch#157735 Approved by: https://github.com/ezyang, https://github.com/Skylion007

Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D78108144 Pull Request resolved: pytorch#158069 Approved by: https://github.com/teja-rao

Fixes pytorch#157973 `THPUtils_unpackNumberAsBool` now recognises `numpy.bool_ scalars` explicitly (using `torch::utils::is_numpy_bool`). If the object is a NumPy boolean, we retrieve its truth value via `PyObject_IsTrue` and return it, avoiding the previous failing path that attempted to treat it as an integer. Pull Request resolved: pytorch#158036 Approved by: https://github.com/jansel

Pull Request resolved: pytorch#156312 Approved by: https://github.com/albanD

It's deprecated since torch==2.7. Pull Request resolved: pytorch#158130 Approved by: https://github.com/justinchuby

)" This reverts commit 6c79530. Reverted pytorch#156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](pytorch#156628 (comment)))

…eparated (pytorch#157676) Written with Claude Code. Fixes pytorch#157569 Fixes pytorch#158134 NumPy and PyTorch handle advanced indexing differently when advanced indices are separated by slices (e.g., arr[:, [0], :, 0]). PyTorch uses "outer" indexing placing result dimensions in original positions, while NumPy uses "vectorized" indexing moving advanced index dimensions to the front. This adds _numpy_style_advanced_indexing() to detect separated advanced indices and transpose results to match NumPy's dimension ordering, ensuring torch._numpy maintains compatibility with NumPy's indexing behavior. Fixes cases like: - arr[:, [0], :, 0] now returns shape (1, 5, 7) instead of (5, 1, 7) - arr[:, [0, 1], :, 0] now returns shape (2, 5, 7) instead of (5, 2, 7) Pull Request resolved: pytorch#157676 Approved by: https://github.com/manuelcandales Co-authored-by: Claude <[email protected]>

Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: pytorch#158152 Approved by: https://github.com/zou3519, https://github.com/eqy

This reverts commit 7a92b51. Reverted pytorch#156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](pytorch#156312 (comment)))

…rch#158089) vLLM's RLHF integration https://github.com/vllm-project/vllm/blob/cf75cd2098f6a3f0bc38d92d1669810c084dab9b/examples/offline_inference/rlhf_utils.py#L93 depends on this hidden feature, adding the test so that PyTorch will not break it in a backward-incompatible way. The goal is to create p2p shared tensors across devices, say sharing process 0's memory on GPU 0, to process 1's memory space on GPU 1, when GPU 0 and GPU 1 can use GPU direct p2p access. Pull Request resolved: pytorch#158089 Approved by: https://github.com/houseroad, https://github.com/ngimel

…ch#157563)" This reverts commit 4b9a6f7. Reverted pytorch#157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](pytorch#157563 (comment)))

Pull Request resolved: pytorch#156312 Approved by: https://github.com/albanD

Pull Request resolved: pytorch#157980 Approved by: https://github.com/anijain2305

From the perivous PR: pytorch#157608 , I added `format_consts_to_cpp` to build consts bytes. But it still raise clang ASAN `stack alloction`, when build large size consts. This PR: 1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list. 2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area. Pull Request resolved: pytorch#158175 Approved by: https://github.com/jansel

Summary: This diff makes changes to the USDT added by RihamSelim in D44636587. The "operator_start" USDT passes in the memory addresses of operator arguments and the argument types. This is so we can record argument values and types in the Strobelight GPUEvent Profiler. The previous diff records the ATEN operator, and this diff lays the groundwork to record ATEN op arguments. Test Plan: I ensured this code builds by running the example in this diff, and testing profiler changes in this diff. Reviewed By: RihamSelim Differential Revision: D75606556 Pull Request resolved: pytorch#155185 Approved by: https://github.com/malfet

@shunting314

The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime | 64 | 256 | 4096 --------------------------------------------------- 64 | 0.0948 | 0.3124 | 4.9477 256 | 0.2243 | 0.2256 | 3.3880 4096 | 0.3384 | 0.3404 | 3.3010 ``` After ``` Hint\Runtime | 64 | 256 | 4096 --------------------------------------------------- 64 | 0.0951 | 0.2289 | 3.3013 256 | 0.0952 | 0.2258 | 3.4045 4096 | 0.0957 | 0.2231 | 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: pytorch#156628 Approved by: https://github.com/jansel

…h.py (pytorch#157847) Pull Request resolved: pytorch#157847 Approved by: https://github.com/Skylion007, https://github.com/zou3519

…c/modules/linear_relu.py (pytorch#157848) Pull Request resolved: pytorch#157848 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#157847

Pull Request resolved: pytorch#158027 Approved by: https://github.com/Skylion007

Pull Request resolved: pytorch#158191 Approved by: https://github.com/Skylion007, https://github.com/malfet

Namely `index_get_offsets`, giving thread index computes offsets into input, output and indices tensors And `index_apply_indices` applies offests to either input or output tensor index Pull Request resolved: pytorch#158178 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: pytorch#158064

# Motivation fix pytorch#110040 Pull Request resolved: pytorch#158189 Approved by: https://github.com/Skylion007, https://github.com/cyyever

As many models require GQA, we support it in flash attention for CPU path. Pull Request resolved: pytorch#157893 Approved by: https://github.com/mingfeima, https://github.com/jansel

…0762) Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized. I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement. Pull Request resolved: pytorch#150762 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

…rch#157810) Fixes pytorch#157720 ### What's in this PR? This PR improves the error handling in `torch.compile` for `ndarray.astype('O')` (or `object`). It now explicitly raises a `torch._dynamo.exc.Unsupported` exception with a clear explanation, instead of failing with a less intuitive error during fake tensor propagation. This is achieved by adding a check within `NumpyNdarrayVariable.call_method` for this specific `astype` pattern. A new test, `test_ndarray_astype_object_graph_break`, is also added to `test/test_numpy_interop.py` to verify this new behavior. ### Background Previously, attempting to `torch.compile` a function containing `ndarray.astype('O')` would result in a `TorchRuntimeError` wrapping a `TypeError: data type 'O' not understood`. This error message, originating deep within the tensor mechanism, was not very user-friendly and didn't clearly state *why* it was unsupported. This change makes the failure more explicit and provides a better user experience by giving a direct, actionable error message. **Old Behavior (Error Traceback):** ``` torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: ... got TypeError("data type 'O' not understood") ``` **New Behavior (Error Message):** ``` torch.dynamo.exc.Unsupported: ndarray.astype(object) Explanation: ndarray.astype('O') or ndarray.astype(object) is not supported by torch.compile, as there is no equivalent to object type in torch. ``` ### Testing A new test has been added to `test_numpy_interop.py` which decorates a function containing `ndarray.astype("O")` with `torch.compile`. The test asserts that a `torch._dynamo.exc.Unsupported` exception is raised, confirming the new error handling works as expected. The test can be run with: `pytest test/test_numpy_interop.py -k test_ndarray_astype_object_graph_break` Pull Request resolved: pytorch#157810 Approved by: https://github.com/jansel

…#156903) Fixes pytorch#156012 This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton. After discussion we are not going to release AOTriton 0.10.1 to fix this due to * Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release. * AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks Pull Request resolved: pytorch#156903 Approved by: https://github.com/jeffdaily, https://github.com/XilunWu

That fixes `index_put(..., accumulate=True)` for all dtypes int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices Pull Request resolved: pytorch#158179 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: pytorch#158064, pytorch#158178

…155112) Re-raising of pytorch#129959 as that was closed. Warning message before: ``` /home/admin/.local/share/hatch/env/virtual/toms-project-1/Qv9k_r_5/dev/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. ``` Warning message after: ``` /path/to/my/code:91: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. ``` Helps the user find where the issue stems from in their code. What do you think? (Looks like "skip_file_prefixes" is not available until Python 3.12 minimum...) Pull Request resolved: pytorch#155112 Approved by: https://github.com/Skylion007, https://github.com/cyyever

updated docs for torch.empty_like to reflect view and dense memory behavior Fixes pytorch#158022 Pull Request resolved: pytorch#158050 Approved by: https://github.com/ngimel, https://github.com/cyyever

malfet and others added 30 commits July 8, 2025 03:46

[BE] Use simdgroup_size constexpr (pytorch#157751)

39a8f66

Instead of every shader defining it separately, move it to `c10/metal/common.h` Pull Request resolved: pytorch#157751 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: pytorch#157746

Split batch-num-heads grid dim between y and z (pytorch#157745)

987314a

for pytorch#157018 doesn't totally fix the problem but should help alot Pull Request resolved: pytorch#157745 Approved by: https://github.com/Chillee

[BE][Easy] add .editorconfig setting for C/C++/CUDA/ObjC (pytorch#1…

bdacf08

…57692) Pull Request resolved: pytorch#157692 Approved by: https://github.com/ezyang

[inductor collectives] sink waits iterative (pytorch#157708)

8134684

Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763) Pull Request resolved: pytorch#157708 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#157706

[BE] add a minimal linter to check pyproject.toml consistency (pyto…

84b77ec

…rch#156017) Pull Request resolved: pytorch#156017 Approved by: https://github.com/ezyang

Deprecate DataLoader pin_memory_device param (pytorch#146821)

ab65581

Following [ pytorch#131858 suggestion](pytorch#131858 (review)) to optimize DataLoader code Pull Request resolved: pytorch#146821 Approved by: https://github.com/divyanshk Co-authored-by: Divyansh Khanna <[email protected]>

Use CMake wholearchive group (pytorch#156393)

7381c77

Use CMake wholearchive group to simplify code. It may also support more OSes. Pull Request resolved: pytorch#156393 Approved by: https://github.com/ezyang

[BE] Update xpu driver repo for CD used almalinux 8.10 (pytorch#157356)

c78bbdf

XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10 Pull Request resolved: pytorch#157356 Approved by: https://github.com/EikanWang, https://github.com/malfet

[PT2][memory] mutation size correctness (pytorch#157562)

86670b3

Pull Request resolved: pytorch#157562 Approved by: https://github.com/yf225

Revert "Cleanup leftover miniconda brew installation (pytorch#156898)"

76fe88f

This reverts commit 214e295. Reverted pytorch#156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](pytorch#156898 (comment)))

[CI][MacOS] Add VENV_PATH to search path (pytorch#157749)

98bb0c0

When building/testing PyTorch on MacOS Shoudl prevent some flakiness when conda environment overtakes CI/CD Pull Request resolved: pytorch#157749 Approved by: https://github.com/atalman, https://github.com/huydhn

[BE]: Update NCCL to 2.27.5 (pytorch#157108)

476874b

Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL. Pull Request resolved: pytorch#157108 Approved by: https://github.com/atalman

Add stack trace of exception to MultiProcContinousTest (pytorch#157589)

0f31445

Pull Request resolved: pytorch#157589 Approved by: https://github.com/Skylion007

ankitageorge and others added 29 commits July 12, 2025 01:02

[BE][2/16] fix typos in torch/ (torch/_*/) (pytorch#156312)

7a92b51

Pull Request resolved: pytorch#156312 Approved by: https://github.com/albanD

[ONNX] Delete torch.onnx.dynamo_export (pytorch#158130)

2eff14c

It's deprecated since torch==2.7. Pull Request resolved: pytorch#158130 Approved by: https://github.com/justinchuby

Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (pytorch#156312)"

e15f424

This reverts commit 7a92b51. Reverted pytorch#156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](pytorch#156312 (comment)))

[BE][2/16] fix typos in torch/ (torch/_*/) (pytorch#156312)

7f14b42

Pull Request resolved: pytorch#156312 Approved by: https://github.com/albanD

[dynamo] trace through torch.get_device_module (pytorch#157980)

6b84cb2

Pull Request resolved: pytorch#157980 Approved by: https://github.com/anijain2305

remove allow-untyped-defs from torch/_higher_order_ops/run_const_grap…

066bf29

…h.py (pytorch#157847) Pull Request resolved: pytorch#157847 Approved by: https://github.com/Skylion007, https://github.com/zou3519

remove allow-untyped-defs from torch/ao/nn/intrinsic/quantized/dynami…

9508d73

…c/modules/linear_relu.py (pytorch#157848) Pull Request resolved: pytorch#157848 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#157847

[build] remove wheel from build requirements (pytorch#158027)

a0308ed

Pull Request resolved: pytorch#158027 Approved by: https://github.com/Skylion007

Fix typo in torch.set_float32_matmul_precision docs (pytorch#158191)

31326a9

Pull Request resolved: pytorch#158191 Approved by: https://github.com/Skylion007, https://github.com/malfet

Fix XPU CI UT test_circular_dependencies (pytorch#158189)

c68af9a

# Motivation fix pytorch#110040 Pull Request resolved: pytorch#158189 Approved by: https://github.com/Skylion007, https://github.com/cyyever

[CPU] Support GQA for flash attention (pytorch#157893)

1f57e0e

As many models require GQA, we support it in flash attention for CPU path. Pull Request resolved: pytorch#157893 Approved by: https://github.com/mingfeima, https://github.com/jansel

Documentation Fix: torch.empty_like memory preservation (pytorch#158050)

0f21fa8

updated docs for torch.empty_like to reflect view and dense memory behavior Fixes pytorch#158022 Pull Request resolved: pytorch#158050 Approved by: https://github.com/ngimel, https://github.com/cyyever

Device agnostic for DCP

8a6ebbb

Chao1Han closed this Jul 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Xpu stage #18

Xpu stage #18

Uh oh!

Chao1Han commented Jul 14, 2025

Uh oh!

Uh oh!

Xpu stage #18

Xpu stage #18

Uh oh!

Conversation

Chao1Han commented Jul 14, 2025

Uh oh!

Uh oh!