Regarding ensemble of attention score

You have mentioned in www.playtika-blog.com/playtika-ai/multi-horizon-forecasting-using-temporal-fusion-transformers-a-comprehensive-overview-part-2/, that "The different heads simply take care of the interactions between the Queries and the Keys, and the outputs of the heads are aggregated and averaged **before multiplying by the projected values**", However In your implementation you have not multiplied the value with the ensemble of attention scores $\tilde{A}(\boldsymbol{Q},\boldsymbol{K})$ You have ensembled the attention scores **after** multiplying with the values. 
```
attention_scores = attn_scores_all_heads.mean(dim=1)
attention_outputs = attn_outputs_all_heads.mean(dim=1)
```
I have seen other implementations as well , they have done the same thing of ensembling **after** multiplying with the values. I may be completely wrong, but ensembling after multiplying by value doesn't seem intuitive. Can you please shed some light on this matter. Thank you.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regarding ensemble of attention score #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regarding ensemble of attention score #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions