[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime

### 🚀 The feature, motivation and pitch

[Microsoft Research](https://www.microsoft.com/en-us/research/) and [Tsinghua University](https://www.tsinghua.edu.cn/en/) researchers have introduced [Differential Transformer](https://arxiv.org/abs/2410.05258) (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.

paper:  [ArXiv](https://arxiv.org/html/2410.05258v1)
press coverage (October 16th): [VentureBeat](https://venturebeat.com/ai/microsofts-differential-transformer-cancels-attention-noise-in-llms/) 

### Alternatives

N/A

### Additional context

github:  [Diff-Transformer](https://github.com/microsoft/unilm/tree/master/Diff-Transformer)

"
multihead_diffattn.py contains naive implementation of multi-head differential attention.

multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our [customized-flash-attention](https://aka.ms/flash-diff) and [xformers](https://github.com/facebookresearch/xformers)).

multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., [flash-attention](https://github.com/Dao-AILab/flash-attention)).

Also refer to https://github.com/microsoft/unilm/pull/1633 for another implementation.
"

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions