Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191

Open
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

lucidrains
Copy link

clean PR

@lucidrains lucidrains changed the title Sha attn Single head attention with differential LR Oct 7, 2021
@lucidrains lucidrains changed the title Single head attention with differential LR Single head attention with decoupled LR Oct 8, 2021
sha_sandwich_norm = true

[aux_decoder]
loss_weight = 0.25
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set this to 0 to turn off auxiliary AR loss

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protocol should be to start off with 0.25 and search for higher values up to 1. if you see continued improvement

bonito/crf/model.py Outdated Show resolved Hide resolved
bonito/crf/model.py Outdated Show resolved Hide resolved
@@ -27,6 +27,9 @@ attn_dropout = 0.1
ff_dropout = 0.1
num_attn_heads = 1

use_isab_attn = true
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using ISAB attention, num_attn_heads above should be set to at least 4

@@ -30,6 +30,8 @@ num_attn_heads = 1
use_isab_attn = true
isab_num_latents = 6

weight_tie_attn_blocks = false
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for parameter saving when using ISAB blocks, which has twice the number of attention parameters than S(M)HA blocks

num_attn_heads = 1 # number of attention heads, which should be kept at 1 for single-head attention, but can be increased to > 1 to turn on multi-head attention
dim_attn_head = 64 # dimension per attention head, should just keep at 64, but can be lowered to 32 for further efficiency / perf tradeoff

use_isab_attn = false # whether to use ISAB attention (induced-set attention block from the Set Transformers paper)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you were to set this to true, the number of attention heads need to be increased to 4 or above. a good starting config would be

num_attn_heads = 4
dim_attn_head = 64
use_isab_attn = true
isab_num_latents = 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants