-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191
Open
lucidrains
wants to merge
63
commits into
nanoporetech:master
Choose a base branch
from
lucidrains:sha-attn
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 19 commits
Commits
Show all changes
63 commits
Select commit
Hold shift + click to select a range
d73a38d
add single-head attention https://arxiv.org/abs/1911.11423
lucidrains 52aa3d5
add all stability related logic from public SHA code
lucidrains e0ec562
allow for removal of ff, given conclusions of https://arxiv.org/abs/2…
lucidrains 59096ba
update to faithful SHA with feedforward instead of Boom
lucidrains c31866f
bidirectional gives best results
lucidrains e8d4d2c
cleanup
lucidrains 47dfbae
Merge branch 'nanoporetech:master' into master
lucidrains 45f5205
add SHA with differential learning rates
lucidrains 7bb19db
lint
lucidrains f75cc14
import SHABlock in trainer file
lucidrains b2ab746
lower SHA learning rate to 1e-4
lucidrains f762f7b
add head scale from normformer paper
lucidrains e891d92
modify single_head_layers hparams to indicate the layer number after …
lucidrains 57d281a
add sandwich norm to SHA
lucidrains 36c903a
add ability to adjust grad clip max norm, in light of attention exper…
lucidrains c03785a
make feedforward sandwich norm as well, for extra stability
lucidrains 79a9336
first commit for decoder autoregressive auxiliary loss
lucidrains 4f8d6c2
fix alibi
lucidrains 9a60831
decouple gradient clipping of attention parameters from non-attention…
lucidrains 4b59e68
add ability to do gradient accumulation, with effective batch size be…
lucidrains a3be656
clean up
lucidrains 92d6f10
Decoder module parameters must be designated as attention parameters …
lucidrains 73a44ec
make default AR loss weight a bit higher
lucidrains b6cff9d
handle losses being a dictionary already
lucidrains 07f3e7e
make sure falling back to not using an AR decoder actually works
lucidrains decfd2b
bug fix for grad accumulation
lucidrains f18f5d2
fix all issues
lucidrains 78cd4ad
move alibi to rotary positional embeddings
lucidrains ce1fa04
fix error with casting to float32
lucidrains e4a7df9
make sure decoder AR gets 0.1 dropout loss for attention and feedforward
lucidrains ab87687
make sure relu squared from primer paper is used (accidentally had th…
lucidrains 2b888c3
fix rotary positional embedding
lucidrains 39a4825
address problem with absolute positional embedding and rotary embeddi…
lucidrains e28fac3
cleanup
lucidrains 8c8a64e
Merge branch 'master' into sha-attn
iiSeymour 0f2fc65
add ability to specify more aggressive gradient clipping for attentio…
lucidrains 2626685
make sure attention uses stable softmax
lucidrains dba58f2
use --attn-clip instead
lucidrains 771f274
one more stability measure for final layernorm in decoder
lucidrains 9c7111f
add yet another stability measure, from cogview paper
lucidrains d34dd2f
cross entropy for auxiliary decoder loss should be done in float32
lucidrains 1b70e1b
use amp.autocast to disable mixed precision for cross entropy calc
lucidrains b6ba887
make sure gradients do not go through numerical stability measures
lucidrains 5c6e200
remove head scaling
lucidrains 9ab7fc4
add pb relax stable softmax technique from cogview paper
lucidrains b2c0447
remove layerscale
lucidrains 2a2205b
add ability to turn off AR auxiliary loss at a certain epoch, or with…
lucidrains 5ac0bac
use ff-geglu over relu squared for now
lucidrains 498c79d
use stable layernorm from cogview paper for norming the encoder embed…
lucidrains 79d803b
fix bug with stable softmax
lucidrains 0ae286c
add ability to have decoder attend to all encoder layers by means of …
lucidrains ad2828b
add ability to turn on scaled cosine sim attention
lucidrains ee5c880
fix bug
lucidrains 48135c6
better init for cosine sim attention learned temp
lucidrains 4231882
prepare for fitting in induced set attention block
lucidrains 0ff4aa1
make learned initial temperature for cosine sim attention customizable
lucidrains 68a6f0d
add induced-set attention blocks, which can be turned on with use_isa…
lucidrains 5cf5be0
ISAB block needs to be included as a module containing attention para…
lucidrains 1520e73
add weight tying feature across transformer blocks, and also set ISAB…
lucidrains 324c9d5
make sure attention head dimension is configurable through toml
lucidrains e1f7a7c
add comments and docs
lucidrains 0a3cbcd
docstrings for SHA and MHA
lucidrains 541f0c3
set some guardrails
lucidrains File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set this to 0 to turn off auxiliary AR loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protocol should be to start off with
0.25
and search for higher values up to1.
if you see continued improvement