Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

lucidrains · 2021-10-07T17:54:45Z

clean PR

2

…108.08207

…which to place an SHA block

…iments being more stable with more aggressive clipping at 0.5 on 20 million chunks

… parameters

lucidrains · 2021-10-22T19:55:24Z

bonito/models/configs/[email protected]

+sha_sandwich_norm = true
+
+[aux_decoder]
+loss_weight = 0.25


set this to 0 to turn off auxiliary AR loss

protocol should be to start off with 0.25 and search for higher values up to 1. if you see continued improvement

…ing --batch * --accum

…for decoupled LR

bonito/crf/model.py

… a command line flag. also add ability to turn off self attention in the AR decoder

…dings as well

…concatting feature dimension across layers

bonito/models/configs/[email protected]

bonito/crf/model.py

…b_attn flag in configs

…meters

… default head dimension to 64

lucidrains · 2021-11-25T22:50:13Z

bonito/models/configs/[email protected]

@@ -27,6 +27,9 @@ attn_dropout = 0.1
 ff_dropout = 0.1
 num_attn_heads = 1

+use_isab_attn = true


when using ISAB attention, num_attn_heads above should be set to at least 4

lucidrains · 2021-11-26T20:13:31Z

bonito/models/configs/[email protected]

@@ -30,6 +30,8 @@ num_attn_heads = 1
 use_isab_attn = true
 isab_num_latents = 6

+weight_tie_attn_blocks = false


for parameter saving when using ISAB blocks, which has twice the number of attention parameters than S(M)HA blocks

lucidrains · 2021-11-26T23:33:34Z

bonito/models/configs/[email protected]

+num_attn_heads = 1        # number of attention heads, which should be kept at 1 for single-head attention, but can be increased to > 1 to turn on multi-head attention
+dim_attn_head = 64        # dimension per attention head, should just keep at 64, but can be lowered to 32 for further efficiency / perf tradeoff
+
+use_isab_attn = false     # whether to use ISAB attention (induced-set attention block from the Set Transformers paper)


if you were to set this to true, the number of attention heads need to be increased to 4 or above. a good starting config would be

num_attn_heads = 4 dim_attn_head = 64 use_isab_attn = true isab_num_latents = 6

lucidrains added 8 commits September 3, 2021 11:52

add single-head attention https://arxiv.org/abs/1911.11423

d73a38d

add all stability related logic from public SHA code

52aa3d5

2

allow for removal of ff, given conclusions of https://arxiv.org/abs/2…

e0ec562

…108.08207

update to faithful SHA with feedforward instead of Boom

59096ba

bidirectional gives best results

c31866f

cleanup

e8d4d2c

Merge branch 'nanoporetech:master' into master

47dfbae

add SHA with differential learning rates

45f5205

lucidrains changed the title ~~Sha attn~~ Single head attention with differential LR Oct 7, 2021

lucidrains added 2 commits October 7, 2021 10:58

lint

7bb19db

import SHABlock in trainer file

f75cc14

lucidrains changed the title ~~Single head attention with differential LR~~ Single head attention with decoupled LR Oct 8, 2021

lucidrains added 9 commits October 8, 2021 07:24

lower SHA learning rate to 1e-4

b2ab746

add head scale from normformer paper

f762f7b

modify single_head_layers hparams to indicate the layer number after …

e891d92

…which to place an SHA block

add sandwich norm to SHA

57d281a

add ability to adjust grad clip max norm, in light of attention exper…

36c903a

…iments being more stable with more aggressive clipping at 0.5 on 20 million chunks

make feedforward sandwich norm as well, for extra stability

c03785a

first commit for decoder autoregressive auxiliary loss

79a9336

fix alibi

4f8d6c2

decouple gradient clipping of attention parameters from non-attention…

9a60831

… parameters

lucidrains commented Oct 22, 2021

View reviewed changes

lucidrains added 7 commits October 22, 2021 13:17

add ability to do gradient accumulation, with effective batch size be…

4b59e68

…ing --batch * --accum

clean up

a3be656

Decoder module parameters must be designated as attention parameters …

92d6f10

…for decoupled LR

make default AR loss weight a bit higher

73a44ec

handle losses being a dictionary already

b6cff9d

make sure falling back to not using an AR decoder actually works

07f3e7e

bug fix for grad accumulation

decfd2b

iiSeymour reviewed Oct 25, 2021

View reviewed changes

bonito/crf/model.py Outdated Show resolved Hide resolved

lucidrains added 2 commits November 6, 2021 10:23

add yet another stability measure, from cogview paper

9c7111f

cross entropy for auxiliary decoder loss should be done in float32

d34dd2f

lucidrains force-pushed the sha-attn branch from 91d7dee to d34dd2f Compare November 8, 2021 20:27

lucidrains added 10 commits November 8, 2021 12:38

use amp.autocast to disable mixed precision for cross entropy calc

1b70e1b

make sure gradients do not go through numerical stability measures

b6ba887

remove head scaling

5c6e200

add pb relax stable softmax technique from cogview paper

9ab7fc4

remove layerscale

b2c0447

add ability to turn off AR auxiliary loss at a certain epoch, or with…

2a2205b

… a command line flag. also add ability to turn off self attention in the AR decoder

use ff-geglu over relu squared for now

5ac0bac

use stable layernorm from cogview paper for norming the encoder embed…

498c79d

…dings as well

fix bug with stable softmax

79d803b

add ability to have decoder attend to all encoder layers by means of …

0ae286c

…concatting feature dimension across layers

lucidrains commented Nov 19, 2021

View reviewed changes

bonito/models/configs/[email protected] Outdated Show resolved Hide resolved

samgd reviewed Nov 20, 2021

View reviewed changes

bonito/crf/model.py Outdated Show resolved Hide resolved

lucidrains added 6 commits November 20, 2021 13:22

add ability to turn on scaled cosine sim attention

ad2828b

fix bug

ee5c880

better init for cosine sim attention learned temp

48135c6

prepare for fitting in induced set attention block

4231882

make learned initial temperature for cosine sim attention customizable

0ff4aa1

add induced-set attention blocks, which can be turned on with use_isa…

68a6f0d

…b_attn flag in configs

lucidrains force-pushed the sha-attn branch from 928bcdc to 68a6f0d Compare November 25, 2021 22:51

lucidrains added 2 commits November 25, 2021 20:53

ISAB block needs to be included as a module containing attention para…

5cf5be0

…meters

add weight tying feature across transformer blocks, and also set ISAB…

1520e73

… default head dimension to 64

lucidrains commented Nov 26, 2021

View reviewed changes

lucidrains added 4 commits November 26, 2021 12:56

make sure attention head dimension is configurable through toml

324c9d5

add comments and docs

e1f7a7c

docstrings for SHA and MHA

0a3cbcd

set some guardrails

541f0c3

lucidrains commented Nov 26, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

lucidrains commented Oct 7, 2021

lucidrains Oct 22, 2021

lucidrains Oct 29, 2021

lucidrains Nov 25, 2021

lucidrains Nov 26, 2021

lucidrains Nov 26, 2021

Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

Are you sure you want to change the base?

Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

Conversation

lucidrains commented Oct 7, 2021

lucidrains Oct 22, 2021

Choose a reason for hiding this comment

lucidrains Oct 29, 2021

Choose a reason for hiding this comment

lucidrains Nov 25, 2021

Choose a reason for hiding this comment

lucidrains Nov 26, 2021

Choose a reason for hiding this comment

lucidrains Nov 26, 2021

Choose a reason for hiding this comment