nnx.make_causal_mask() usage #4505

windmaple · 2025-01-25T11:26:27Z

So this is a follow-up on #4290 (@cgarciae). For building a causal LM, I need to use causal masking. Here is my attempt (by adding a single line using the code from #4290:

batch_size = 2
seqlen = 40
emb_size = 256

x = jnp.ones((batch_size, seqlen, emb_size))

mha = nnx.MultiHeadAttention(
  in_features=emb_size, num_heads=2, decode=True, rngs=nnx.Rngs(0)
)
shape = x.shape

 for i in range(seqlen): # iterate all tokens
  y = mha(inputs_q=x[:, i : i + 1],
          mask=nnx.make_causal_mask(x[:, i : i + 1]))   #newly added

The error I got is:

AssertionError: masks must have same rank: (5, 4)

I cannot make sense of this error :(

The text was updated successfully, but these errors were encountered:

cgarciae · 2025-01-28T03:34:38Z

Hi @windmaple, in decode mode MultiHeadAttention is always causal, meaning you don't have to provide a mask in this case. See:

flax/flax/nnx/nn/attention.py

Lines 518 to 528 in a8a192f

    
           # causal mask for cached decoder self-attention: 
        
           # our single query position should only attend to those key 
        
           # positions that have already been generated and cached, 
        
           # not the remaining zero elements. 
        
           mask = combine_masks( 
        
             mask, 
        
             jnp.broadcast_to( 
        
               jnp.arange(max_length) <= cur_index, 
        
               tuple(batch_dims) + (1, 1, max_length), 
        
             ), 
        
           )

windmaple · 2025-01-28T12:11:28Z

Yeah, I realized that, since we are feeding token in one by one.

However, for some reason it's not working as expected. I'll try to provide a repro.

windmaple · 2025-01-30T10:40:30Z

Here is the notebook:
https://colab.research.google.com/drive/1kk7xcFSA7KzVQnekfqmdd1Gq_Z4qsLvU#scrollTo=NIOXoY1xgiww

Turning on KV cache makes it so much slower, which doesn't make any sense to me :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nnx.make_causal_mask() usage #4505

nnx.make_causal_mask() usage #4505

windmaple commented Jan 25, 2025

cgarciae commented Jan 28, 2025 •

edited

Loading

windmaple commented Jan 28, 2025

windmaple commented Jan 30, 2025

nnx.make_causal_mask() usage #4505

nnx.make_causal_mask() usage #4505

Comments

windmaple commented Jan 25, 2025

cgarciae commented Jan 28, 2025 • edited Loading

windmaple commented Jan 28, 2025

windmaple commented Jan 30, 2025

cgarciae commented Jan 28, 2025 •

edited

Loading