AI to predict a masked word in a text sequence using Google BERT Masked Language Model from Hugging Face Transformers
-
AI to predict a masked word in a text sequence using Google Bidirectional Encoder Representations from Transformers (BERT) Masked Language Model from Hugging Face and generate diagrams visualizing attention scores for each of the 144 self-attention heads for a given sentence.
-
Built with Python, TensorFlow, Transformers (Hugging Face), Pillow (PIL), BERT, and more
-
Used TensorFlow to get top k predicted tokens from vocabulary logits for mask token from the input sequence

These diagrams can give us some insight into what BERT has learned to pay attention to when trying to make sense of language. Below is the attention diagram for Layer 3, Head 10 when processing the sentence “Then I picked up a [MASK] from the table.”

Lighter colors represent higher attention weight and darker colors represent lower attention weight. In this case, this attention head appears to have learned a very clear pattern: each word is paying attention to the word that immediately follows it. The word “then”, for example, is represented by the second row of the diagram, and in that row the brightest cell is the cell corresponding to the “i” column, suggesting that the word “then” is attending strongly to the word “i”. The same holds true for the other tokens in the sentence.
I was curious to know if BERT pays attention to the role of adverbs. I gave the model a sentence like “The turtle moved slowly across the [MASK].” and then looked at the resulting attention heads to see if the language model seems to notice that “slowly” is an adverb modifying the word “moved”. Looking at the resulting attention diagrams, one that catched my eye was Layer 4, Head 11.

This attention head is definitely noisier: it’s not immediately obvious exactly what this attention head is doing. But notice that, for the adverb “slowly”, it attends most to the verb it modifies: “moved”. The same is true if we swap the order of verb and adverb.

And it even appears to be true for a sentence where the adverb and the verb it modifies aren’t directly next to each other.

This head shows a diagonal pattern where tokens are paying attention to the tokens that precede them in the input sequence.
Example Sentences:
- I threw a small rock and it fell in the [MASK].

- I was walking with my dog [MASK] it started barking.

This head focuses primarily on the SEP token, with the pronoun "it" paying attention to the object it is referring to , i.e. "rock" in the 1st sentence and "dog" in the 2nd sentence.
Example Sentences:
- I threw a small rock and it fell in the [MASK].

- I was walking with my dog [MASK] it started barking.
