Introduction to BERT

NICD

2022-01-25

Content

  • Part 1: BERT Basics and Applications
  • Part 2: BERT Architecture
  • Part 3: Pre-Training Tasks
  • Part 4: BERT Variants

Part 1: BERT Basics and Applications

Introduction

Bidirectional Encoder Representations from Transformers BERT is the most important natural language processing model to date

It can be used for a variety of tasks including:

  • token classification (e.g. named entity recognition or question answering)
  • text classification (e.g. sentiment analysis)
  • text-pair classification (e.g. sentence similarity)

What BERT cannot do ?

BERT is not suitable for the following tasks:

  • text generation
  • machine translation
  • text summarisation

These are tasks which require decoder or encoder-decoder architectures. BERT is an encoder-only architecture.

Why is BERT useful ?

All downstream natural language processing tasks benefit from improved word embeddings

BERT–style architectures produce state-of-the-art word embeddings since they can be trained on the large volumes of data and yield contextual word representations

BERT Illustrated

At a high-level, BERT takes the following structure:

Part 2: BERT Architecture

Self-Attention

Self-attention is the primary mechanism used to enhance word embeddings with context. To understand we will explore a more complex sentence.

Bert or Ernie?

This sentence includes the word “he”. Question: Does “he” refer to Bert or Ernie? The answer should inform how the word embedding for “he” is enhanced.

Weighted-Average

Producing an enhanced word embedding for “he” with self-attention involves taking a weighted-average of the word embeddings. The weights indicate how important the context words are to the enhancement. For example we would expect the word “Bert” to have a larger weight than “Ernie” in the previous example.

Enhanced Word Embedding

  • \(n\) is the number of words in the sentence
  • \(i\) is the position of the word in the sentence
  • \(\mathbf{x}_i\in\mathbb{R}^{768}\) is an embedding for the word in position \(i\)
  • \(w_i\) is a weight for the word in position \(i\)

Enhanced word embedding:

\[ \mathbf{x}=\sum_{i=1}^nw_i\mathbf{x}_i \]

Values, Keys and Queries

Before we go into anymore detail we first have to introduce some matrices:

  • \(X\in\mathbb{R}^{n\times 768}\) rows are input word embeddings
  • \(V=X\Theta_V\in\mathbb{R}^{n\times d}\) rows are value word embeddings
  • \(K=X\Theta_K\in\mathbb{R}^{n\times d}\) rows are key word embeddings
  • \(Q=X\Theta_Q\in\mathbb{R}^{n\times d}\) rows are query word embeddings

where \(\Theta_V\), \(\Theta_K\) and \(\Theta_Q\) are \(d\times d\) parameter matrices. These are often referred to as projection matrices! Note: for now we assume \(d=768\).

Value Projection

Key Projection

Query Projection

Self-Attention

Self-Attention Equation

Let \(\mathbf{q}\) be a query embedding (row vector) for the word embedding \(\mathbf{x}\). The corresponding enhanced word embedding is:

\[ \text{softmax}\left(\frac{\mathbf{q}K^\top}{\sqrt{d}}\right)V \]

Why \(\sqrt{d}\)?

The Square Root

If \(X=[X_1,\dots,X_d]\) and \(Y=[Y_1,\dots,Y_d]\) are constructed from independent random variables where \(\mathbb{E}[X_i]=\mathbb{E}[Y_i]=0\) and \(\mathbb{V}[X_i]=\mathbb{V}[Y_i]=1\) for all \(i\) then:

\[ \mathbb{V}[XY^\top]=\sum_{i=1}^d\mathbb{V}[X_iY_i]= \sum_{i=1}^d\mathbb{V}[X_i]\mathbb{V}[Y_i]=d \]

Consequently \(\mathbb{V}[XY^\top/\sqrt{d}]=1\). This is the trick used in the self-attention equation to stabilise the variance and gradients.

Feed-Forward Neural Network

BERT has 12 layers and each layer is a combination of self-attention and a feed-forward neural network.

Feed-Forward Neural Network

The feed-forward neural networks have a single hidden layer of dimension 3,072 (four times the embedding dimension)

Number of Parameters per Layer

  • The Self-Attention component consists of three projection matrices, each \(768\times768\), for a total of \(​3\times768\times768=1,769,472\)​ parameters.
  • The feed-forward neural network consists of the hidden-layer and output-layer weights, both \(768\times3,072\), for a total$ of \(​2\times768\times3072=4,718,592\)​ weights.
  • This means that the feed-forward neural network is roughly ​2.66x larger than the self-attention component!

Multi-Headed Attention

In our first example of self-attention we discussed how the word embedding for “he” could be enhanced with the word embedding for “Bert” to which it refers. The self-attention to capture references is just one type of self-attention. For example:

is a self-attention mechanism that captures adjectives.

Multi-Headed Attention

  • In BERT each self-attention component in each layer has 12 heads. That is, each layer has 12 self-attention components that capture 12 different aspects of attention (e.g., adjectives, references).
  • However, with 12 self-attention components we have a \(12\times 764=9,216\) dimensional output rather than a \(764\) dimensional output. This gets reduced with another parameter matrix.

Multi-Headed Attention

Recombining Self-Attention

Number of Parameters

  • With 12 self-attention components in each layer we have increased the number of parameters by 12
  • Additionally we have added a recombination component with \(9,215\times768=7,077,888\) parameters
  • To help with this explosion of parameters BERT reduces the embedding size to \(d=64\)
  • This means no recombination matrix is required!

Full-Size Attention Head

Reduced Attention Head

Other Details

Other than the main architectural components that we have covered there are other details that we have not discussed:

  1. Layer Normalisation
  2. Residual Connections
  3. Drop out

These are located throughout the model to improved training.

Part 3: Pre-Training Tasks

BERT

  • Published in 2018, Bidirectional Encoder Representations from Transformers BERT is considered to be the first deeply bidrectional, unsupervised (contextual) language representation model pre-trained using only a plain text corpus (BookCorpus and Wikipedia)

  • BERT’s novelty lies in the way it was pre-trained:

    • masked language model (MLM) – randomly mask some of the tokens from the input and predict the original vocabulary id of the masked token
    • next sentence prediction (NSP) – predict if the two sentences were following each other or not

Illustration of MLM and NSP

Consider the following passage of text:

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks

  • MLM: predict the crossed out word (“simple”)
  • NSP: was sentence B found immediately after sentence A or from somewhere else ?

Pre-trained representations

  • Pre-trained representations in language models can either be context-free or contextual

  • Context-free models, such as word2vec and GloVe generate a single word embedding representation for each word in the vocabulary

  • Contrastingly, contextual models instead generate a representation of each word based on the other words in the sentence

  • Contextual representations can be either:

    • unidirectional – context is conditional upon preceding words
    • bidirectional – context is conditional on both preceding and following words

Masked Language Modelling

12% of tokens are masked out for prediction.

Other Predictions

Other Predictions

Full Example

Next Sentence Prediction

Training

In BERT MLM and NSP are trained concurrently.

“The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood”

Part 4: BERT Variants

BERT and Derivatives

BERT has a number of children models, including:

  • ALBERT, which performs n-gram MLM and uses Sentence Order Prediction instead of NSP
  • RoBERTa and XLNet, which remove NSP
  • ELECTRA, which performs MLM with plausible, generated tokens
  • DistilBERT, which uses knowledge distilation to reduce the size of BERT

BERT and Sentence Length

The self-attention equation involves calculating \(QK^\top\) which requires \(\mathcal{O}(n^2)\) computation and memory. This limits the lengths of sentences that can be contextually embedded with BERT. There are a number of ways to reduce these costs and they often exploit sparsity.

BERT and Sentence Length

The above models are all examples of BERT-based architectures that have exploited combinations of sparsity to reduce computational and memory cost. Where as BERT can process sentences that are 512 tokens long, Longformer and BigBird can process sentences that are 4096 tokens long.