[ET-VK][ez] Add support for buffer backed qparams in int4 linear + add checks for physical limits when allocating #9974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

facebook-github-bot merged 5 commits into gh/SS-JIA/209/base from gh/SS-JIA/209/head

Apr 16, 2025

Contributor

SS-JIA commented Apr 8, 2025 •

edited

Loading

Stack from ghstack (oldest at bottom):

Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a Texture3D. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

Changes

Add support for the scales and zero tensor being a Buffer instead of a Texture3D
Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: D72662176


          [ET-VK][ez] Add support for buffer backed qparams in int4 linear + ad…

39e52d6

…d checks for physical limits when allocating

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          [ET-VK][ez] Add support for buffer backed qparams in int4 linear + ad…

6c0c9d5

…d checks for physical limits when allocating

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

ghstack-source-id: 276858281
Pull Request resolved: #9974

pytorch-bot bot commented Apr 8, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9974

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ae71c7f with merge base 6d1caca ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Apr 8, 2025

This pull request was exported from Phabricator. Differential Revision: D72662176

facebook-github-bot added the fb-exported label


          Update on "[ET-VK][ez] Add support for buffer backed qparams in int4 …

356277b

…linear + add checks for physical limits when allocating"

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          [ET-VK][ez] Add support for buffer backed qparams in int4 linear + ad…

53dd19d

…d checks for physical limits when allocating

Pull Request resolved: #9974

## Context

At a high level, this diff addresses preventing the allocation of textures that exceed physical texture limits, especially in the context of running transformer models.

Currently, the groupwise quantized int4 linear op implementation sets the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

Exceeding the maximum image extents can lead to undefined behaviour, and therefore should be avoided.

Also related, the Vulkan delegate did not properly understand the maximum image extents properly. The physical device limits has three fields that indicate maximum image extents:

* `maxImageDimension1D`
* `maxImageDimension2D`
* `maxImageDimension3D`

Currently, the delegate interprets `maxImageDimension1D` as the maximum image extent in the width axis, `maxImageDimension2D` as the maximum image extent in the height axis, and `maxImageDimension3D` as the maximum image extent in the depth axis.

In reality, `maxImageDimension3D` represents "the largest dimension (`width`, `height`, or `depth`) that is guaranteed to be supported for all images created with an `imageType` of `VK_IMAGE_TYPE_3D`". To properly guard against exceeding device limits, this misconception must be rectified.

As an additional consequence, the maximum image extent allowed for 3D tensors is much smaller than previously thought. An example maximum extents for Adreno 740:

```
      maxImageDimension1D                 16384
      maxImageDimension2D                 16384
      maxImageDimension3D                 2048
```

Evidently, `maxImageDimension3D` is 8 times smaller than `maxImageDimension2D` or `maxImageDimension1D`. The exact ratio will be different depending on the GPU (I believe on some GPUs it might even be the same) but in general this knowledge reduces the threshold at which tensors can be represented via `Texture3D`.

Anecdotally, I have also observed that on Adreno it is possible to allocate 3D images with extents that exceed `maxImageDimension3D` and accessing these textures within a compute shader works fine as well. But I will have to do some more research to determine if I am just getting lucky not being impacted by undefined behaviour, or if the reported `maxImageDimension3D` is not entirely accurate.

To use texture storage for larger tensors, the `Texture2D` storage type should be used instead of `Texture3D`.

## Changes

Changed the int4 linear operator to use buffer storage type for scales and zeros. The storage type is not selected dynamically in the interest of reducing the number of shader variants that willl need to be generated.

Changed the  int4 linear operator to use `Texture2D` for quantized weights instead of `Texture3D` which should be a perf boost as well as increasing the threshold for which texture storage can still be used.

When checking if image extents are within physical limits, use `maxImageDimension3D` only instead of treating `{maxImageDimension1D, maxImageDimension2D, maxImageDimension3D}` as separate components.

Before allocating a buffer or texture resource for a tensor, check that the resource fits within physical device limits.

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)
ghstack-source-id: 277047171

Contributor

facebook-github-bot commented Apr 9, 2025

This pull request was exported from Phabricator. Differential Revision: D72662176


          Update on "[ET-VK][ez] Add support for buffer backed qparams in int4 …

…linear + add checks for physical limits when allocating"

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          [ET-VK][ez] Add support for buffer backed qparams in int4 linear + ad…

824c9b8

…d checks for physical limits when allocating

Pull Request resolved: #9974

## Context

At a high level, this diff addresses preventing the allocation of textures that exceed physical texture limits, especially in the context of running transformer models.

Currently, the groupwise quantized int4 linear op implementation sets the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

Exceeding the maximum image extents can lead to undefined behaviour, and therefore should be avoided.

Also related, the Vulkan delegate did not properly understand the maximum image extents properly. The physical device limits has three fields that indicate maximum image extents:

* `maxImageDimension1D`
* `maxImageDimension2D`
* `maxImageDimension3D`

Currently, the delegate interprets `maxImageDimension1D` as the maximum image extent in the width axis, `maxImageDimension2D` as the maximum image extent in the height axis, and `maxImageDimension3D` as the maximum image extent in the depth axis.

In reality, `maxImageDimension3D` represents "the largest dimension (`width`, `height`, or `depth`) that is guaranteed to be supported for all images created with an `imageType` of `VK_IMAGE_TYPE_3D`". To properly guard against exceeding device limits, this misconception must be rectified.

As an additional consequence, the maximum image extent allowed for 3D tensors is much smaller than previously thought. An example maximum extents for Adreno 740:

```
      maxImageDimension1D                 16384
      maxImageDimension2D                 16384
      maxImageDimension3D                 2048
```

Evidently, `maxImageDimension3D` is 8 times smaller than `maxImageDimension2D` or `maxImageDimension1D`. The exact ratio will be different depending on the GPU (I believe on some GPUs it might even be the same) but in general this knowledge reduces the threshold at which tensors can be represented via `Texture3D`.

Anecdotally, I have also observed that on Adreno it is possible to allocate 3D images with extents that exceed `maxImageDimension3D` and accessing these textures within a compute shader works fine as well. But I will have to do some more research to determine if I am just getting lucky not being impacted by undefined behaviour, or if the reported `maxImageDimension3D` is not entirely accurate.

To use texture storage for larger tensors, the `Texture2D` storage type should be used instead of `Texture3D`.

## Changes

Changed the int4 linear operator to use buffer storage type for scales and zeros. The storage type is not selected dynamically in the interest of reducing the number of shader variants that willl need to be generated.

Changed the  int4 linear operator to use `Texture2D` for quantized weights instead of `Texture3D` which should be a perf boost as well as increasing the threshold for which texture storage can still be used.

When checking if image extents are within physical limits, use `maxImageDimension3D` only instead of treating `{maxImageDimension1D, maxImageDimension2D, maxImageDimension3D}` as separate components.

Before allocating a buffer or texture resource for a tensor, check that the resource fits within physical device limits.
ghstack-source-id: 277106587

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

Contributor

facebook-github-bot commented Apr 9, 2025

This pull request was exported from Phabricator. Differential Revision: D72662176


          Update on "[ET-VK][ez] Add support for buffer backed qparams in int4 …

a5d888c

…linear + add checks for physical limits when allocating"

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

[ghstack-poisoned]

SS-JIA mentioned this pull request

[ET-VK] Allow int4 linear to execute without 8bit buffer support #10030

Merged

Contributor

facebook-github-bot commented Apr 9, 2025

This pull request was exported from Phabricator. Differential Revision: D72662176


          Update on "[ET-VK][ez] Add support for buffer backed qparams in int4 …

ae71c7f

…linear + add checks for physical limits when allocating"

## Context

Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device.

## Changes

* Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D`
* Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits

Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

[ghstack-poisoned]

This was referenced Apr 15, 2025

[ET-VK] Add co-op algorithm for 4 bit weight only quantized linear #10204

Merged

[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10205

Merged

Contributor

facebook-github-bot commented Apr 15, 2025

This pull request was exported from Phabricator. Differential Revision: D72662176

trivedivivek approved these changes

View reviewed changes

trivedivivek added the topic: not user facing label

facebook-github-bot merged commit fa2b715 into gh/SS-JIA/209/base

84 of 85 checks passed

facebook-github-bot deleted the gh/SS-JIA/209/head branch

April 16, 2025 18:17

facebook-github-bot temporarily deployed to cherry-pick-bot

April 16, 2025 18:17

— with

GitHub Actions Inactive

pytorchbot mentioned this pull request

[ET-VK][ez] Add support for buffer backed qparams in int4 linear + add checks for physical limits when allocating #10233

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported topic: not user facing