Fusing matrices #3037

stefan-it · 2023-04-18T13:57:47Z

stefan-it
Apr 18, 2023

Hi,

during integrating the umT5 architecture (which is called "Scalable T5" in T5X repository) into 🤗 Transformers library, we saw a "joined" or fused matrix representation with shape of head x kv. The T5X repo states this operation as:

For "heads * kv" fused dimension of attention matrices, when the kernel is reshaped such that "heads" and "kv" are packed in the same dimension.

(-> Source)

But how could this be done in Flax?

For example we have a tensor of (512, 6, 64) and we need a "fused" representation with a shape of (512, 384).

I used:

.reshape(config.d_model, config.num_heads * config.d_kv)

where config.d_model = 512, config.num_heads = 6 and config.d_kv = 64 to get the desired fused representation.

But is this the correct way to do that?

cgarciae · 2023-04-18T14:11:25Z

cgarciae
Apr 18, 2023
Maintainer

Hey @stefan-it, can you point to the exact line where this fusion is happening?

But how could this be done in Flax?

MultiHeadDotProductAttention fuses heads with by using DenseGeneral but computes key, query, and value using 3 separate calls. This could probably be done in a single call but merging parameters is not always desired e.g. second order optimizers prefer them to be separate.

0 replies

stefan-it · 2023-04-18T14:23:24Z

stefan-it
Apr 18, 2023
Author

Hi @cgarciae ,

thanks for your fast reply.

I see the only difference here (there's this functools.partial function with kernel_axes as arg. Original implementation has this joined_kv:

projection = functools.partial(
        DenseGeneral,
        axis=-1,
        features=(self.num_heads, self.head_dim),
        kernel_axes=('embed', 'joined_kv'),
        dtype=self.dtype)

whereas the scalable architecture has:

projection = functools.partial(
        DenseGeneral,
        axis=-1,
        features=(self.num_heads, self.head_dim),
        kernel_axes=('embed', 'heads', 'kv'),
        dtype=self.dtype)

For converting weights we need to manually fuse heads and kv to get this joined_kv representation. I hope this helps!

1 reply

cgarciae Apr 18, 2023
Maintainer

Performance-wise this fusion is probably faster but its not desireable to for all cases which is probably why we don't do it in our standard layer. I think its better to think of the layers provided by Flax as reference implementations rather what you should use in all cases, this is specially true for transformer-related abstractions as there are so many variants and possible optimizations. If you do need this fusion we recommend forking the layer (copy-paste into your own project and make the desired changes), Flax layers are designed to be some what readable to make forking easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fusing matrices #3037

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fusing matrices #3037

Uh oh!

stefan-it Apr 18, 2023

Replies: 2 comments · 1 reply

Uh oh!

cgarciae Apr 18, 2023 Maintainer

Uh oh!

stefan-it Apr 18, 2023 Author

Uh oh!

cgarciae Apr 18, 2023 Maintainer

stefan-it
Apr 18, 2023

Replies: 2 comments 1 reply

cgarciae
Apr 18, 2023
Maintainer

stefan-it
Apr 18, 2023
Author

cgarciae Apr 18, 2023
Maintainer