So, Let’s BERT!

4 min readAug 22, 2020

Introduction

GPT is a fine-tunable pre-trained model that is based on the Transformer, but this Transformer was trained only on a forward language model. ELMo’s language model was bi-directional but trained with LSTM. BERT that was proposed by Devlin et al., at 2018, compared to the previous models contains both; it has a Transformer-based model whose language model looks both forward and backwards and it uses Transformers instead of RNNs to process text and combines context from both directions.

What is BERT?

Bidirectional Encoder Representations from Transformers (BERT) is a self-supervised approach for pre-training a deep transformer encoder. BERT compared to the previous models proposes a transformer-based model whose language model looks both forward and backwards and also enjoys from the transformer’s benefits. Each layer in BERT applies self-attention, passes its results through a feed-forward network, and then hands it off to the next encoder that learns a representation for each token.

So, How does it work?

Model’s input

Given a sequence of tokens X = (x1, x2, . . . , xn), BERT pads the input sentence with [CLS] and [SEP] tokens. Then it trains an encoder that produces a contextualized vector representation for each token: encoder(x1, x2, . . . , xn) = x1, x2, . . . , xn. Finally, the representation of each token in the sequence is constructed by summing the corresponding token, segment, and position embeddings.

The [SEP] token indicates when the next sentence starts for the NSP task. The [CLS] token is added to sequence A and sequence B to form the input, where the target of [CLS] is whether sequence B indeed follows sequence A in the corpus. [CLS] stands for classification tasks and it supposes to summarize the sentence and the output of the last layer of this token will be used for the classification step.

Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” *arXiv preprint arXiv:1810.04805* (2018).‏

In the end of the pre-processing level the input will be composed of:

Token Embedding — The original sequence that was padded with [CLS] and [SEP].
Segment Embedding —Indicates the sequence that the token belongs to.
Position Embedding — Indicates the order of the tokens in the sequence.

Training the Model

BERT pre-trains the model parameters by two tasks, the masked language model (MLM) and the next sentence prediction (NSP).

MLM — In this task the the 15% of the input’s tokens are substituted randomly. Of those, 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are kept unchanged. The task is to predict the original tokens from the modified input according to its context.

NSP — In this task the mission is to predict whether two sequences are following or not.

The loss is trained on two tasks on the same time: Masked LM (MLM) + Next sentence prediction (NSP). The learning rate is warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed.

Different ways to use BERT

There are two ways to use the model:

Pre-training approach — to create contextualized word embeddings.
Fine-tuning approach- fine-tune the pre-trained model and then feed these embeddings to existing model.

There are two sizes for the model:

BERT BASE — 12 encoder layers
BERT LARGE — 24 encoder layers

Disadvantages of the Model

No relationship between masked words — A masked token that the model used for prediction will not be available for another prediction.
MLM is not a real task — The training is not useful in the real life.
Maximum sequence length is limited to 512 tokens — Can’t deal with really long sequence (e.g; a full book).
Not an auto-regressive model — Just 15% of the data is used for the training set.
Can’t deal with span — BERT model have achieved high performance on supervised dataset that masks individual tokens. However, when it engaged with tasks that involved reasoning about relationships between spans of text such as question answering it was more challenging target.
Can’t deal with sentence generation

End Notes

Language model pre-training method has been found to be effective for solving many NLP tasks. In particular the pretrained BERT gained lately huge leverage and obtains new state-of-the-art results on eleven tasks. The tasks that BERT has been applied to are typically modeled as classification problems and sequence labeling tasks except of the SQuAD question answering (Rajpurkar et al., 2016) task, in which the objective is to find the starting point and ending point of an answer span.

References

[https://arxiv.org/abs/1810.04805]

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Discussions: Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments) Translations…

jalammar.github.io