Content
- Part 1: BERT Basics and Applications
- Part 2: BERT Architecture
- Part 3: Pre-Training Tasks
- Part 4: BERT Variants
NICD
2022-01-25
Bidirectional Encoder Representations from Transformers BERT is the most important natural language processing model to date
It can be used for a variety of tasks including:
BERT is not suitable for the following tasks:
These are tasks which require decoder or encoder-decoder architectures. BERT is an encoder-only architecture.
All downstream natural language processing tasks benefit from improved word embeddings
BERT–style architectures produce state-of-the-art word embeddings since they can be trained on the large volumes of data and yield contextual word representations
At a high-level, BERT takes the following structure:
Self-attention is the primary mechanism used to enhance word embeddings with context. To understand we will explore a more complex sentence.
This sentence includes the word “he”. Question: Does “he” refer to Bert or Ernie? The answer should inform how the word embedding for “he” is enhanced.
Producing an enhanced word embedding for “he” with self-attention involves taking a weighted-average of the word embeddings. The weights indicate how important the context words are to the enhancement. For example we would expect the word “Bert” to have a larger weight than “Ernie” in the previous example.
Enhanced word embedding:
\[ \mathbf{x}=\sum_{i=1}^nw_i\mathbf{x}_i \]
Before we go into anymore detail we first have to introduce some matrices:
where \(\Theta_V\), \(\Theta_K\) and \(\Theta_Q\) are \(d\times d\) parameter matrices. These are often referred to as projection matrices! Note: for now we assume \(d=768\).
Let \(\mathbf{q}\) be a query embedding (row vector) for the word embedding \(\mathbf{x}\). The corresponding enhanced word embedding is:
\[ \text{softmax}\left(\frac{\mathbf{q}K^\top}{\sqrt{d}}\right)V \]
Why \(\sqrt{d}\)?
If \(X=[X_1,\dots,X_d]\) and \(Y=[Y_1,\dots,Y_d]\) are constructed from independent random variables where \(\mathbb{E}[X_i]=\mathbb{E}[Y_i]=0\) and \(\mathbb{V}[X_i]=\mathbb{V}[Y_i]=1\) for all \(i\) then:
\[ \mathbb{V}[XY^\top]=\sum_{i=1}^d\mathbb{V}[X_iY_i]= \sum_{i=1}^d\mathbb{V}[X_i]\mathbb{V}[Y_i]=d \]
Consequently \(\mathbb{V}[XY^\top/\sqrt{d}]=1\). This is the trick used in the self-attention equation to stabilise the variance and gradients.
BERT has 12 layers and each layer is a combination of self-attention and a feed-forward neural network.
The feed-forward neural networks have a single hidden layer of dimension 3,072 (four times the embedding dimension)
In our first example of self-attention we discussed how the word embedding for “he” could be enhanced with the word embedding for “Bert” to which it refers. The self-attention to capture references is just one type of self-attention. For example:
is a self-attention mechanism that captures adjectives.
Other than the main architectural components that we have covered there are other details that we have not discussed:
These are located throughout the model to improved training.
Published in 2018, Bidirectional Encoder Representations from Transformers BERT is considered to be the first deeply bidrectional, unsupervised (contextual) language representation model pre-trained using only a plain text corpus (BookCorpus and Wikipedia)
BERT’s novelty lies in the way it was pre-trained:
Consider the following passage of text:
BERT is conceptually
simpleand empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks
Pre-trained representations in language models can either be context-free or contextual
Context-free models, such as word2vec and GloVe generate a single word embedding representation for each word in the vocabulary
Contrastingly, contextual models instead generate a representation of each word based on the other words in the sentence
Contextual representations can be either:
12% of tokens are masked out for prediction.
In BERT MLM and NSP are trained concurrently.
“The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood”
BERT has a number of children models, including:
The self-attention equation involves calculating \(QK^\top\) which requires \(\mathcal{O}(n^2)\) computation and memory. This limits the lengths of sentences that can be contextually embedded with BERT. There are a number of ways to reduce these costs and they often exploit sparsity.
The above models are all examples of BERT-based architectures that have exploited combinations of sparsity to reduce computational and memory cost. Where as BERT can process sentences that are 512 tokens long, Longformer and BigBird can process sentences that are 4096 tokens long.