Add MPS support #190

12v · 2025-03-11T18:48:28Z

Hello, this is an attempt to add MPS support.

There are two issues preventing MPS from working with the transformer backend:

torch.compile doesn't support MPS (see: here)

The solution for this issue is straightforward, just adding another condition check before using torch.compile.

Grouped Query Attention (GQA) only works with CUDA (see: here)

This is more complex. Falling back to Multi-Headed Attention (MHA) on MPS requires the same number of heads for KV as for Q, but the pre-trained weights expect a smaller number of KV heads than Q heads. My current solution for this is to duplicate the KV heads and weights to match the number of Q heads.

Aside from the extra code, the main downside of this approach is that the weights saved from a model on MPS can't be loaded again (whether on MPS or any other backend). Possible paths forward:

Continue with this approach, and just log a warning or error if the model is trained on MPS
When on MPS, copy and compress the weights internally to a shape corresponding to the number of heads used by GQA so they can be saved and loaded
...

Additionally, an alternative to this approach of transformer the weights within the model is to instead transform the pre-trained weights outside of the model before loading.

Notes

The recording generated on MPS doesn't sound as good as the recording generated on CPU.

tjameswilliams · 2025-03-13T15:42:37Z

@12v have you done testing on this branch on an Apple device? Does it speed this up considerably? This is awesome, because inference is insanely slow on mac (I am using an m4 Max and a few seconds of audio takes minutes.)

I tried solving this myself, but continued to run into issues with tokenization.

tjameswilliams · 2025-03-13T16:10:35Z

Actually I just cloned your branch and tested it. Still no support for the Hybrid, but that's ok, the transformer is massively faster on MPS.

ReadyPlayerEmma · 2025-03-13T21:36:42Z

How much is the quality affected? Is there a way to get the behavior/quality to match the CPU case? What exactly is causing the quality loss?

tjameswilliams · 2025-03-14T00:35:43Z

@ReadyPlayerEmma to be fair (and complete) the quality loss happens on CUDA too. My guess would simply be floating point precision is less precise.

Aedelon · 2025-04-14T23:54:04Z

I just saw your PR. I made mine without being aware of your work.

In model.py, you can compile with the backend "aot_eager". It is also indicated on the link you shared: LINK

decode_one_token = torch.compile( decode_one_token, dynamic=True, backend="aot_eager", disable=cg or disable_torch_compile )

Add MPS support

cae44a2

This was referenced Mar 11, 2025

MPS Support #185

Open

macos support #44

Open

12v marked this pull request as ready for review March 12, 2025 18:40

Aedelon mentioned this pull request Apr 15, 2025

Debug MPS #215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MPS support #190

Add MPS support #190

Uh oh!

12v commented Mar 11, 2025

Uh oh!

tjameswilliams commented Mar 13, 2025

Uh oh!

tjameswilliams commented Mar 13, 2025

Uh oh!

ReadyPlayerEmma commented Mar 13, 2025

Uh oh!

tjameswilliams commented Mar 14, 2025

Uh oh!

Aedelon commented Apr 14, 2025

Uh oh!

Uh oh!

Add MPS support #190

Are you sure you want to change the base?

Add MPS support #190

Uh oh!

Conversation

12v commented Mar 11, 2025

torch.compile doesn't support MPS (see: here)

Grouped Query Attention (GQA) only works with CUDA (see: here)

Notes

Uh oh!

tjameswilliams commented Mar 13, 2025

Uh oh!

tjameswilliams commented Mar 13, 2025

Uh oh!

ReadyPlayerEmma commented Mar 13, 2025

Uh oh!

tjameswilliams commented Mar 14, 2025

Uh oh!

Aedelon commented Apr 14, 2025

Uh oh!

Uh oh!