From Transfer Learning for Natural Language Processing by Paul Azunre

In this article, we cover some representative deep transfer learning modeling architectures for NLP that rely on a recently popularized neural architecture – the transformer – for key functions.

Take 40% off Transfer Learning for Natural Language Processing by entering fccazunre into the discount code box at checkout at

This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we look at modeling frameworks such as the generative pretrained transformer (GPT), bidirectional encoder representations from transformers (BERT) and multilingual BERT (mBERT). These methods employ neural networks with more parameters than most deep convolutional and recurrent neural network models. Despite the larger size, they’ve exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice.

Until the arrival of the transformer, the dominant NLP models relied on recurrent and convolutional components. Additionally, the best sequence modeling and transduction problems, such as machine translation, rely on an encoder-decoder architecture with an attention mechanism to detect which parts of the input influence each part of the output. The transformer aims to replace the recurrent and convolutional components entirely with attention.

The goal of this article is to provide you with a working understanding of this important class of models, and to help you develop a good sense about where some of its beneficial properties come from. The article also introduces an important library – aptly named transformers – that makes the analysis, training, and application of these types of models in NLP particularly user-friendly. Additionally, we use the tensor2tensor Tensorflow package to help visualize attention functionality. The presentation of each transformer-based model architecture – GPT, BERT and mBERT – is followed with representative code applying it to a relevant task.

GPT, which stands for the “Generative Pretrained Transformer”, is a transformer-based model which is trained with a causal modeling objective, i.e., to predict the next word in a sequence. Therefore, this model is particularly suited for text-generation. It was developed by the OpenAI organization. We show how to employ pretrained GPT weights for this purpose in this article with the transformers library.

BERT, which stands for “Bidirectional Encoder Representations from Transformers” is a transformer-based model. It was trained with the masked modeling objective, i.e., to “fill-in-the-blanks”. Additionally, it was trained with the next sentence prediction task, i.e., to determine whether a given sentence is a plausible sentence to follow after a target sentence. Although unsuited for text-generation, this model performs well on other general language tasks such as classification and question answering. Because we already explored classification at some length, we can use the question answering task to explore this model architecture in detail.

mBERT, which stands for “Multilingual BERT”, is effectively BERT pretrained on over one hundred languages simultaneously. Naturally, this model is particularly well-suited for cross-lingual transfer learning. We show how the multilingual pretrained weights checkpoint can facilitate creating BERT embeddings for languages that weren’t originally included in the multilingual training corpus. Both BERT and mBERT were created at Google.

We begin the article with a section that delves into the core transformer architecture, then review fundamental architectural components and visualize them in some detail with the tensor2tensor package. We follow that up with a section overviewing the GPT architecture, with text generation as a representative application of pretrained weights. A section on BERT follows, which we apply to the important question answering application as a representative example in a standalone section. The chapter concludes with an experiment showing the transfer of pretrained knowledge from mBERT pretrained weights to a BERT embedding for a new language. This new language wasn’t initially included in the multilingual corpus used to generate the pretrained mBERT weights. We use the Ghanaian language Twi as the illustrative language in this case. This application example also provides an opportunity to explore fine-tuning pretrained BERT weights on a new corpus further. Note that Twi is an example of a low resource language – one for which high quality training data is scarce, if available at all.

The Transformer

In this section, we look closer at the fundamental transformer architecture behind the neural model family covered by this article. This architecture was developed at Google and was motivated by the observation that the best performing translation models up to that point employed convolutional and recurrent components in conjunction with a mechanism called attention.

More specifically, such models employ an encoder-decoder architecture, where the encoder converts the input text into some intermediate numerical vector representation, typically called the context vector, and a decoder that converts this vector into output text. Attention allows for better performance in these models, by modeling dependencies between parts of the output and various parts of the input. Typically, attention had been coupled with recurrent components. Because such components are inherently sequential – the internal hidden state at any given position t depends on the hidden state at the previous position t-1 – parallelization of the processing of a long input sequence isn’t an option. Parallelization across such input sequences, on the other hand, quickly runs into GPU memory limitations.

The transformer discards recurrence and replaces all functionality with attention. More specifically, it uses a flavor of attention called self-attention. Self-attention is attention as previously described but applied to the same sequence as both input and output. This allows it to learn the dependencies between every part of the sequence and every other part of the same sequence. Figure 3 revisits and illustrate this idea in more detail;don’t worry if you can’t visualize that fully yet. These models have better parallelizability versus the aforementioned recurrent models. At various points of this section, we’ll use the example sentence “He didn’t want to talk about cells on the cell phone, a subject he considered very boring” to study how various aspects of the infrastructure work.

Now that we understand the basics of the motivation behind this architecture, let’s take a look at a simplified “bird’s-eye-view” level representation of the various building blocks. These are shown in Figure 1.

Figure 1. A “bird’s-eye-view” high-level representation of the transformer architecture, showing stacked encoders, decoders, input/output embeddings and positional encodings.

We see from the figure that identical encoders are stacked on the encoding left hand side of the architecture. The number of stacked encoders is a tunable hyper-parameter, with the original paper working with 6. Similarly, on the decoding right hand side of the architecture six identical decoders are stacked. We also see that both the input and output are converted into vectors using an embedding algorithm of choice.  This could be a word embedding algorithm such as word2vec, or even a CNN applied to one-hot encoded character vectors. Additionally, we encode the sequential nature of the inputs and outputs using positional encodings. These allow us to discard recurrent components and maintain sequential awareness.

Each encoder can be roughly decomposed into a self-attention layer followed by a feed-forward neural network. This is illustrated in Figure 2.

Figure 2. Simplified decomposition of the encoder and decoder into self-attention, encoder-decoder attention and feed-forward neural networks.

As can be seen from the figure, each decoder can be similarly decomposed with the addition of an encoder-decoder attention layer between the self-attention layer and the feed-forward neural network. Note that in the self-attention of the decoder future tokens are “masked” when computing attention for that token. The decomposed decoder is also shown in Figure 2. Although the self-attention learns the dependencies of every part of its input sequence and every other part of the same sequence, encoder-decoder attention learns similar dependencies between the inputs to the encoder and decoder. This is similar to the way attention was initially used in the sequence-to-sequence recurrent translation models.

The self-attention layer in Figure 2 can further be refined into multi-head attention – a multi-dimensional analog of self-attention that leads to improved performance. We analyze self-attention in further detail in following subsections and build on the insights gained to cover multi-head attention subsequently. The bertviz package is used for visualization purposes to provide further insights.

An Introduction to the Transformers Library and Attention Visualization

Before we discuss in detail how various components of multi-head attention works, let’s visualize it for the example sentence “He didn’t want to talk about cells on the cell phone, a subject he considered very boring.” This exercise also allows us to introduce the transformers Python library from Hugging Face. The first step towards doing this is to obtain required dependencies using the following commands.

 !pip install tensor2tensor
 !git clone

Note: The exclamation sign ! is only required when executing in a Jupyter environment, such as the Kaggle environment we recommend for these exercises. When executing  via  a  terminal  it  should  be dropped.

The package tensor2tensor contains the original implementation of the transformers architecture by its original authors, together with some visualization utilities. The bertviz library is an extension of these visualization utilities to a large set of the models within the transformers library. The transformers library can be installed with

 !pip install transformers

Note that it’s already installed on new notebooks on Kaggle.

For our visualization purposes, we look at the self-attention of a BERT encoder. It’s arguably the most popular flavor of the transformers-based architecture and similar to the encoder in the encoder-decoder architecture of the original architecture in Figure 1. All you need to note is that the BERT encoder is identical to that of the transformer.

For any pretrained model that you want to load in the transformers library, you need to load a tokenizer as well as the model. We do this with the following commands.

 from transformers import BertTokenizer, BertModel #A
 model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True) #B
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) #C

# A transformers BERT tokenizer and model

# B load uncased BERT model, making sure to output attention

# C load uncased BERT tokenizer

You can tokenize our running example sentence, encode each token as its index in the vocabulary, and display the outcome using the following code.

 sentence = "He didnt want to talk about cells on the cell phone because he considered it boring"
 inputs = tokenizer.encode(sentence, return_tensors='tf', add_special_tokens=True) #A

# A changing return_tensors to “pt” returns PyTorch tensors

This yields the following output.

 [[  101  2002  2134  2102  2215  2000  2831  2055  4442  2006  1996  3526
    3042  2138  2002  2641  2009 11771   102]], shape=(1, 19), dtype=int32)

We could have easily returned a PyTorch tensor by setting return_tensors='pt'. To see which tokens these indices correspond to, we can execute the following code on the inputs variable.

 tokens = tokenizer.convert_ids_to_tokens(list(inputs[0])) #A

# A Extract sample of batch index 0 from inputs list of lists

This produces the following output.

 ['[CLS]', 'he', 'didn', '##t', 'want', 'to', 'talk', 'about', 'cells', 'on', 'the', 'cell', 'phone', 'because', 'he', 'considered', 'it', 'boring', '[SEP]']

We notice immediately that the “special tokens” we requested via the add_special_tokens argument when encoding the inputs variable refers to the ‘[CLS]’ and ‘[SEP]’ tokens in this case. The former indicates the beginning of a sentence/sequence, and the latter indicates the separation point between multiple sequences or the end of a sequence (as in this case). Note that these are BERT-dependent, and you should check the documentation of each new architecture you try for which special tokens it uses. The other thing we notice from this tokenization exercise is that the tokenization is sub-word – notice how “didnt” was split into “didn” and “##t” even without the apostrophe ‘ which we deliberately omitted.

Let us proceed to visualizing the self-attention layer of the BERT model we loaded. To achieve this, we define the following function.

 from bertviz.bertviz import head_view #A
 def show_head_view(model, tokenizer, sentence): #B
     input_ids = tokenizer.encode(sentence, return_tensors='pt', add_special_tokens=True) #C
     attention = model(input_ids)[-1] #D
     tokens = tokenizer.convert_ids_to_tokens(list(input_ids[0]))   
     head_view(attention, tokens) #E

# A bertviz attention head visualization method

# B Function for displaying the multiheaded attention

# C Be sure to use PyTorch with bertviz

# D Get attention layer

# E Call the internal bertviz method to display self-attention

Figure 3 shows the resulting self-attention visualization of the final 11th BERT layer or our running example sentence. You should play with the visualization and scroll through the visualizations of the various words for the various layers. Note that not all the attention visualizations may be as easy to interpret as this example, and it may take some practice to build intuition for it.

Figure 3. Self-attention visualization in the final encoding layer of the pretrained uncased BERT model for our running example sentence. It reveals that “cells” is associated with “it” and “boring”. Note that this is a multi-head view, with the shadings in each single column representing each head.

That was it! Now that we’ve a sense for what self-attention does, having visualized it in Figure 3, let us get a bit into the mathematical details of how it works. We first start with self-attention in the next subsection, and then extend our knowledge to the full multi-headed context afterwards.


Consider again the running example sentence “He didn’t want to talk about cells on the cell phone, a subject he considered very boring.” Suppose we wanted to figure out which noun the adjective “boring” was describing. Being able to answer a question like this is an important ability a machine needs to have in order to understand context. We know it refers to “it”, which refers to “cells”, naturally. This was confirmed by our visualization in Figure 3. A machine needs to be first taught this sense of context. Self-attention is the method which accomplishes this in the transformer. As every token in the input is processed, self-attention looks at all other tokens to detect possible dependencies.

How does self-attention work to accomplish this? We visualize the essential ideas of this in Figure 4. In the figure, we compute the self-attention weight for the word “boring”. Before delving into further detail, please observe that once the various q, k and v vectors for the various words are obtained, they can be processed independently.

Figure 4. A visualization of the calculation of the self-attention weight of the word “boring” in our running example sentence. Observe that the computations of these weights for different words can be carried out independently once key, value and query vectors have been created. This is the root of the increased parallelizability of transformers over recurrent models. The attention coefficients are what is visualized as intensity of shading in any given column of the multi-head attention in Figure 3.

Each word is associated with a query vector q, a key vector k and a value vector v. These are obtained by multiplying the input embedding vectors by three matrices that are learned during training. These matrices are fixed across all input tokens. As shown in the figure, the query vector for the current word “boring” is used in a dot product with each word’s key vector. The results are scaled by a fixed constant – the square root of the dimension the key and value vectors – and fed to a softmax. The output vector yields the attention coefficients indicating the strength of the relationship between the current token “boring” and every other token in the sequence. Observe that the entries of this vector indicate the strength of the shadings in any given single column of the multi-head attention we visualized in Figure 3. We duplicate Figure 3 next for your convenience, and you can inspect the variability in shadings between the various lines.

Figure 3. (Duplicated) Self-attention visualization in the final encoding layer of the pretrained uncased BERT model for our running example sentence. It reveals that “cells” is associated with “it” and “boring”. Note that this is a multi-head view, with the shadings in each single column representing each head.

We are now in a good position to understand why transformers are more parallelizable than recurrent models. Recall from our presentation that the computations of self-attention weights for different words can be carried out independently, once the key, value and query vectors have been created. This means that for long input sequences, one can parallelize these computations. Recall that recurrent models are inherently sequential – the internal hidden state at any given position t depends on the hidden state at the previous position t-1. This means that parallelization of the processing of a long input sequence isn’t possible in recurrent models – the steps have to be executed one after the other. Parallelization across such input sequences, on the other hand, quickly runs into GPU memory limitations. An additional advantage of transformers over recurrent model is the increased interpretability afforded by attention visualizations, such as the one in Figure 3.

Note that the computation of the weight for every token in the sequence can be carried out independently, although some dependence between computations exists through the key and value vectors. This means that we can vectorize the overall computation using matrices as shown in Figure 5.5. The matrices Q, K and V in that equation are the matrices made up of query, key and value vectors stacked together as matrices.

Figure 5. Vectorized self-attention calculation for the whole input sequence using matrices

Now, what exactly is the deal with multi-head attention? Because we presented self-attention, we’re at a good point to address that. We’ve implicitly been presenting multi-head attention as a generalization of self-attention from a single column, in the sense of the shadings in Figure 3, to multiple columns. Let us think about what we were doing when we looked for the noun which “boring” refers to. Technically, we were looking for a noun-adjective relationship. Assume we had one self-attention mechanism that tracked that kind of relationship. Now, what if we also needed to track subject-verb relationships? What about all other possible relationship kinds? Multi-head attention addresses that by providing multiple representation dimensions, not only one.

Residual Connections, Encoder-Decoder Attention and Positional Encoding

The transformer is a complex architecture and there are various other features which we don’t cover in as much detail as self-attention. We feel that mastery of these details isn’t critical for you to begin applying transformers to your own problems. Therefore, we only briefly summarize them here and encourage you to delve into the original source material to deepen your knowledge over time as you gain more experience and intuition.

As a first such feature, we note that the simplified encoder representation in Figure 2 doesn’t show an additional residual connection between each self-attention layer in the encoder and the normalization layer that follows it. This is illustrated in Figure 6.

Figure 6. A more detailed and accurate (than Figure 2) breakdown of each transformer encoder, now incorporating residual connections and normalization layers.

As shown in the figure, each feed-forward layer has a residual connection and a normalization layer after it. Analogous statements are true for the decoder. These residual connections allow gradients to skip the nonlinear activation functions within the layers, alleviating the problem of vanishing and/or exploding gradients. Normalization ensures that the scale of input features to all layers are roughly the same.

On the decoder side, recall from Figure 2 the existence of the encoder-decoder attention layer which we haven’t yet addressed. We duplicate Figure 2 next and highlight this layer for your reading convenience.

Figure 2. (Duplicated, Encoder-Decoder Attention highlighted) Simplified decomposition of the encoder and decoder into self-attention, encoder-decoder attention and feed-forward neural networks.

It works analogously to the self-attention layer as described. The important distinction is that the input vectors to each decoder that represent keys and values come from the top of the encoder stack, and the query vectors come from the layer immediately below it. If you go through Figure 4 again, with this updated information in mind, you should find it obvious that the effect of this change is to compute attention between every output token and every input token – rather than between all tokens in the input sequence as was the case with the self-attention layer. We duplicate Figure 4 next – adjusting it slightly for the encoder-decoder attention case – for your convenience, to convince yourself.

Figure 4. (Duplicated, slightly adjusted for encoder-decoder attention calculation) A visualization of the calculation of the encoder-decoder attention weight between the word “boring” in our running example sentence and the output at position n. Observe that the computations of these weights for different words can be carried out independently once key, value and query vectors have been created. This is the root of the increased parallelizability of transformers over recurrent models.

On both encoder and decoder sides, recall from Figure 1 the existence of the positional encoding, which we address finally. Because we’re dealing with sequences, it’s important to model and retain the relative positions of each token in each sequence. Our description of the transformer operation this far has not touched on “positional encoding” and has been agnostic to the order in which the tokens are consumed by each self-attention layer. The positional embeddings address this, by adding to each token input embedding a vector of equal size which is a special function of the token’s position in the sequence. The authors used sine and cosine functions, of frequencies that are position dependent, to generate these positional embeddings.

This brings us to the end of the transformers architecture exposition. To make things concrete, we conclude this section by translating a couple of English sentences to a low resource language using a pretrained encoder-decoder model.

Application of Pretrained Encoder-Decoder to Translation

The goal of this subsection is to expose you to a large set of translation models available at your fingertips within the transformers library. Over one thousand pretrained models have been recently made available by the Language Technology Research Group at the University of Helsinki. At the time of writing of this article, these are the only available open-sourced models for many low-resource languages. We use the popular Ghanaian language Twi here as an example here. They were trained on the JW300 corpus, which contains the only existing parallel translated datasets for many low resource languages.

Unfortunately, JW300 is extremely biased data, being religious text translated by the Jehovah Witnesses organization. Our investigation revealed that the models are of decent quality as an initial baseline for further transfer learning and refinement. We don’t explicitly refine the baseline model on better data here, due to data collection challenges and lack of existing appropriate datasets.

Without further ado, let us load the pretrained English to Twi translation model and tokenizer using the following code.

 from transformers import MarianMTModel, MarianTokenizer
 model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-tw")
 tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-tw")

The MarianMTModel class is a port of encoder-decoder transformer architecture from the C++ library MarianNMT. Note that you can change the source and target languages by changing the language codes “en” and “tw” to representative codes, if made available by the research group. For instance, loading a French to English model changes the input configuration string to “Helsinki-NLP/opus-mt-fr-en”.

If we were chatting with a friend in Ghana online, and wanted to know how to write “My name is Paul” by way of introduction, we could use the following code to compute and display the translation.

 text = "My name is Paul" #A
 inputs = tokenizer.encode(text, return_tensors="pt") #B
 outputs = model.generate(inputs) #C
 decoded_output = [tokenizer.convert_ids_to_tokens(int(outputs[0][i])) for i in range(len(outputs[0]))] #D
 print("Translation:") #E

# A Input English Sentence to be translated

# B Encode to input token ids

# C Generate output token ids

# D Decode output token ids to actual output tokens

# E Display translation

The resulting output from running the code is shown next.

 ['<pad>', 'Me', 'din', 'de', 'Paul']

The first thing we immediately notice is the presence of a special token <pad> in the output that we haven’t seen before, as well as underscores before each word. The technical reason for this is that BERT uses a tokenizer called “WordPiece” and our encoder-decoder model here uses “SentencePiece”. Although we don’t get into the detailed differences between these tokenizer types here, we use this opportunity to warn you once again to review documentation about any new tokenizer you try.

The translation “Me din de Paul” happens to be exactly right. Amazing! That wasn’t too hard, was it? Repeating the exercise for the input sentence “How are things?” yields the translation “Ɔkwan bɛn so na nneɛma te saa?” which back-translates literally into “In which way are things like this?” We can see that although the semantics of this translation appear close, the translation is wrong. The semantic similarity is a sign that this model is a good baseline which could be improved further via transfer learning, if good parallel English-Twi data were available. Moreover, rephrasing the input sentence to “How are you?” yields the correct translation “Wo ho te dɛn?” from this model. Overall, this is an encouraging result, and we hope that some readers are inspired to work to extend these baseline models to some excellent open source transformer models for some previously unaddressed low resource languages of choice.

That’s all for this article.

If you want to learn more about the book, you can check it out on our browser-based liveBook platform here.