-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Description
🚀 The feature, motivation and pitch
Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.
paper: ArXiv
press coverage (October 16th): VentureBeat
Alternatives
N/A
Additional context
github: Diff-Transformer
"
multihead_diffattn.py contains naive implementation of multi-head differential attention.
multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers).
multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).
Also refer to microsoft/unilm#1633 for another implementation.
"
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.