question about the loss compute #203

jamesben6688 · 2025-03-04T01:02:36Z

Hi, the ar_loss is compute as:

loss, aux_losses = ar_loss(
                lengths=seq_features.past_lengths,  # [B],
                output_embeddings=seq_embeddings[:, :-1, :],  # [B, N-1, D]
                supervision_ids=supervision_ids[:, 1:],  # [B, N-1]
                supervision_embeddings=input_embeddings[:, 1:, :],  # [B, N - 1, D]
                supervision_weights=ar_mask.float(),
                negatives_sampler=negatives_sampler,
                **seq_features.past_payloads,
            )  # [B, N]

So the prediction is output_embeddings, and the supervision is supervision_ids[:, 1:] instead of target_id. However, the output_embeddings is computed using MultiHeadAttention rather than MaskedMultiHeadAttention. This means that the output_embeddings at time t can see the data after t. Will this be a problem?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about the loss compute #203

question about the loss compute #203

jamesben6688 commented Mar 4, 2025

question about the loss compute #203

question about the loss compute #203

Comments

jamesben6688 commented Mar 4, 2025