Skip to content

AI to predict a masked word in a text sequence using Google Bidirectional Encoder Representations from Transformers (BERT) Masked Language Model from Hugging Face Transformers

Notifications You must be signed in to change notification settings

madhav1k/Attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attention

AI to predict a masked word in a text sequence using Google BERT Masked Language Model from Hugging Face Transformers

  • AI to predict a masked word in a text sequence using Google Bidirectional Encoder Representations from Transformers (BERT) Masked Language Model from Hugging Face and generate diagrams visualizing attention scores for each of the 144 self-attention heads for a given sentence.

  • Built with Python, TensorFlow, Transformers (Hugging Face), Pillow (PIL), BERT, and more

  • Used TensorFlow to get top k predicted tokens from vocabulary logits for mask token from the input sequence

image

These diagrams can give us some insight into what BERT has learned to pay attention to when trying to make sense of language. Below is the attention diagram for Layer 3, Head 10 when processing the sentence “Then I picked up a [MASK] from the table.”

Layer 3, Head 10

image

Lighter colors represent higher attention weight and darker colors represent lower attention weight. In this case, this attention head appears to have learned a very clear pattern: each word is paying attention to the word that immediately follows it. The word “then”, for example, is represented by the second row of the diagram, and in that row the brightest cell is the cell corresponding to the “i” column, suggesting that the word “then” is attending strongly to the word “i”. The same holds true for the other tokens in the sentence.

I was curious to know if BERT pays attention to the role of adverbs. I gave the model a sentence like “The turtle moved slowly across the [MASK].” and then looked at the resulting attention heads to see if the language model seems to notice that “slowly” is an adverb modifying the word “moved”. Looking at the resulting attention diagrams, one that catched my eye was Layer 4, Head 11.

Layer 4, Head 11

image

This attention head is definitely noisier: it’s not immediately obvious exactly what this attention head is doing. But notice that, for the adverb “slowly”, it attends most to the verb it modifies: “moved”. The same is true if we swap the order of verb and adverb.

image

And it even appears to be true for a sentence where the adverb and the verb it modifies aren’t directly next to each other.

image

Layer 8, Head 5

This head shows a diagonal pattern where tokens are paying attention to the tokens that precede them in the input sequence.

Example Sentences:

  • I threw a small rock and it fell in the [MASK].
image
  • I was walking with my dog [MASK] it started barking.
image

Layer 9, Head 11

This head focuses primarily on the SEP token, with the pronoun "it" paying attention to the object it is referring to , i.e. "rock" in the 1st sentence and "dog" in the 2nd sentence.

Example Sentences:

  • I threw a small rock and it fell in the [MASK].
image
  • I was walking with my dog [MASK] it started barking.
image

About

AI to predict a masked word in a text sequence using Google Bidirectional Encoder Representations from Transformers (BERT) Masked Language Model from Hugging Face Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages