From Transfer Learning for Natural Language Processing by Paul Azunre

This article discusses getting started with baselines and generalized linear models.

Take 37% off Transfer Learning for Natural Language Processing by entering fccazunre into the discount code box at checkout at

Neural Network Models

Neural networks are the most important class of machine learning algorithms for handling perceptual problems such as computer vision and NLP. Thus, it is the most important class of models for the subject covered by this book.

In this post, we will train two representative pretrained neural network language models on the two illustrative example problems we have been baselining in this chapter. The two models we will consider here are

  • ELMo – Embeddings from Language Models, and
  • BERT – Bidirectional Encoder Representations from Transformers.

ELMo includes elements of convolutional and recurrent (specifically LSTM) elements, while the appropriately named BERT is transformer-based. The simplest form of transfer learning fine-tuning will be employed here, where a single dense classification layer is trained on top of the corresponding pretrained embedding over our dataset of labels from the previous sections.

Embeddings from Language Models (ELMo)

The Embeddings from Language Models (ELMo) model, named after the popular Sesame Street character, was among the first models to demonstrate the effectiveness of transferring pretrained language model knowledge to general NLP tasks. The model was trained to predict the next word in a sequence of words, which can be done in an unsupervised manner on very large corpuses, and showed that the weights obtained as a result could generalize to a variety of other NLP tasks. We will not discuss the architecture of this model in detail in this section – this will be done in the appropriate subsequent chapter. It will suffice to mention here that the model employs character-level convolutions to build up preliminary embeddings of each word token, followed by bidirectional LSTM layers which introduce context of surrounding words into the final embeddings produced by the model.

Having briefly introduced ELMo, let’s proceed to training it for each of the two running example datasets. The ELMo model is available through the Tensorflow Hub, which provides an easy platform for sharing Tensorflow models. We will use Keras with Tensorflow backend to build our model.  In order to make the tensorflow hub model usable by Keras, we will need to define a custom Keras layer that instantiates it in the right format. This is achieved by the function shown in Listing 1.

Listing 1. Function to instantiate Tensorflow Hub ELMo as a custom Keras layer.

 import tensorflow as tf   # A
 import tensorflow_hub as hub
 from keras import backend as K
 import keras.layers as layers
 from keras.models import Model, load_model
 from keras.engine import Layer
 import numpy as np
 sess = tf.Session()   # B
 class ElmoEmbeddingLayer(Layer):   # C
     def __init__(self, **kwargs):
         self.dimensions = 1024
         super(ElmoEmbeddingLayer, self).__init__(**kwargs)
     def build(self, input_shape):
         self.elmo =hub.Module('', trainable=self.trainable,
                                name="{}_module".format(   # D
         self.trainable_weights +=
                "^{}_module/.*".format( # E
         super(ElmoEmbeddingLayer, self).build(input_shape)
     def call(self, x, mask=None):
         result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
         return result
     def compute_output_shape(self, input_shape): # F
         return (input_shape[0], self.dimensions)

# A Import required dependencies

# B Initialize session

# C Create a custom layer that allows us to update weights

# D Download pretrained ELMo model from Tensorflow Hub

# E Extract trainable parameters – these are just 4 weights in the weighted average of ELMo model layers, see tf hub link above for more details

# F Specify shape of output

Assume the availability of a data variable raw_data – a list containing a concatenated string of word tokens per email. We can use the code in Listing 2 to build and train the Keras ELMo TensorFlow hub model.

Listing 2. Function and calling script to build ELMo TensorFlow hub model for Keras using the custom layer defined in Listing 2.16.

 def build_model():
   input_text = layers.Input(shape=(1,), dtype="string")
   embedding = ElmoEmbeddingLayer()(input_text)
   dense = layers.Dense(256, activation='relu')(embedding) # A
   pred = layers.Dense(1, activation='sigmoid')(dense) # B
   model = Model(inputs=[input_text], outputs=pred)
   model.compile(loss='binary_crossentropy', optimizer='adam',
                                  metrics=['accuracy'])  # C
   model.summary() # D
   return model
 # Build and fit
 model = build_model(),     # E
           validation_data=(test_x, test_y),

# A new layer outputting 256-dimensional feature vectors

# B Classification layer

# C loss, metric and optimizer choices

# D Show model architecture for inspection

# E Fit the model for 5 epochs

A few things should be noted here – first of all, notice that we have added an additional layer on top the pretrained ELMo embedding, producing 256-dimensional feature vectors. We have also added a classification layer of output dimension 1. The activation function ‘sigmoid’ transforms its input into the interval between 0 and 1, and is essentially the logistic curve. Its output can be interpreted as the probability of the positive class, and when it exceeds some prespecified threshold (usually 0.5) the corresponding input to the network can be classified as the said positive class.

The model is fitted for 5 “major steps” or epochs over the whole dataset. The Keras code statement ‘model.summary()’ in Listing 2 prints the model details, and produces the following output:

 Layer (type)                 Output Shape              Param #  
 input_2 (InputLayer)         (None, 1)                 0         
 elmo_embedding_layer_2 (Elmo (None, 1024)              4        
 dense_3 (Dense)              (None, 256)               262400   
 dense_4 (Dense)              (None, 2)                 514      
 Total params: 262,918
 Trainable params: 262,918
 Non-trainable params: 0

We note, without delving into too much further detail as this will be addressed by Chapter 4 of the book, that most of the trainable parameters in this case (approximately 260 thousand of them) are coming from the layers we added on top of the custom ELMo model. In other words, this is our first instance of transfer learning – learning a pair of new layers on top of the pretrained model shared by ELMo’s creators. We also note that it is important to use a powerful GPU for most NN experiments, and the value of the `batch_size’ parameter – which specifies how much data is fed to the GPU at each step – can be extremely important to the speed of convergence. It will vary by the GPU being used or the lack thereof. In practice, one can increase the value of this parameter until the speed of convergence of a typical problem instance does not benefit from the increase, or whenever the GPU memory is no longer large enough for a single data batch to fit on it during an iteration of the algorithm, whichever happens first. Additionally, when dealing with a multi-GPU scenario, some evidence that the optimal scaling-up schedule of the batch size is linear in the number of GPUs, has been presented[1].

On a free NVIDIA Tesla K80 GPU via a Kaggle Kernel (see our companion github repo[2] for Kaggle notebook links), we achieve the performance on our email dataset for the first 5 epochs as shown in Figure 1 for a typical run. We found a batch_size of 32 to work well for us in that context.

Figure 1. Convergence of the validation and training accuracy scores for the first five epochs of training the ELMo model on the email classification example.

Each epoch takes approximately 10 seconds to complete – this information is printed by our code. We see that a validation accuracy of approximately 97.3% is attained at the 4th epoch, i.e., in under a minute. This performance is comparable to the performance of the logistic regression approach, which is only slightly better at 97.7% (see Chapter 2 of book). We note that the behavior of the algorithm is stochastic, i.e., it behaves differently from run to run. Thus, your own convergence will vary somewhat, even on similar architecture to what we used. It is typical in practice to try the algorithm run a few times, and pick the best set of parameters among the stochastic and varying results attained. Finally, we note that the divergence of training and validation accuracies is suggestive of the beginning of overfitting as indicative in the figure.

For the IMDB example, the ELMo model code yields the convergence output shown in Figure 2.

Figure 2. Convergence of the validation and training accuracy scores for the first five epochs of training the ELMo model on the IMDB movie review classification example.

Each epoch again takes approximately 10 seconds and a validation accuracy of approximately 70% is achieved in under a minute at the 2nd epoch. We will see how to improve the performances of the end of this article (see also Table 1). Note that some evidence of overfitting can be observed at the 3rd and later epochs, as the training accuracy continues to improve, i.e., the fit to the data improves, while the validation accuracy remains lower.

Bidirectional Encoder Representations from Transformers (BERT)

Bidirectional Encoder Representations from Transformers (BERT) model was also named after a popular Sesame Street character as a nod to the trend started by ELMo. At the time of writing this book, its variants achieve some of the best performance in transferring pretrained language model knowledge to downstream NLP tasks. The model was similarly trained to predict words in a sequence of words, although the exact masking procedure is somewhat different. It can also be done in an unsupervised manner on very large corpuses, and the resulting weights similarly generalize to a variety of other NLP tasks. Arguably, to familiarize oneself with transfer learning in NLP, it is indispensable for one to familiarize oneself with BERT.

Just as we did with ELMo, we will again not discuss the architecture of this deep learning model in complete detail in this section – this will be done in an appropriate chapter of the book. It will suffice to mention here that the model employs character-level convolutions to build up preliminary embeddings of word tokens, followed by transformer-based encoders with self-attention layers that provide the model with a context of surrounding words. The transformer functionally replaced the role of the bidirectional LSTMs employed by ELMo. Recalling from the previous chapter that Transformers have some advantages versus LSTMs with respect to training scalability, we see some of the motivation behind this model. Again, we will use Keras with Tensorflow backend to build our model.

Having briefly introduced BERT, let’s proceed to training it for each of the two running example datasets. The BERT model is also available through the Tensorflow Hub.  In order to make the hub model usable by Keras, we similarly define a custom Keras layer that instantiates it in the right format, as shown in Listing 3.

Listing 3. Function to instantiate Tensorflow Hub BERT as a custom Keras layer.

 import tensorflow as tf
 import tensorflow_hub as hub
 from bert.tokenization import FullTokenizer
 from tensorflow.keras import backend as K
 # Initialize session
 sess = tf.Session()
 class BertLayer(tf.keras.layers.Layer):
     def __init__(
         n_fine_tune_layers=10, # A
         pooling="mean", # B
         bert_path="", # C
         self.n_fine_tune_layers = n_fine_tune_layers
         self.trainable = True
         self.output_size = 768 # D
         self.pooling = pooling
         self.bert_path = bert_path
         super(BertLayer, self).__init__(**kwargs)
     def build(self, input_shape):
         self.bert = hub.Module(
             self.bert_path, trainable=self.trainable, name=f"{}_module"
         trainable_vars = self.bert.variables # E
         if self.pooling == "first":
             trainable_vars = [var for var in trainable_vars if not "/cls/" in]
             trainable_layers = ["pooler/dense"]
         elif self.pooling == "mean":
             trainable_vars = [
                 for var in trainable_vars
                 if not "/cls/" in and not "/pooler/" in
             trainable_layers = []
             raise NameError("Undefined pooling type”)
         for i in range(self.n_fine_tune_layers): # F
             trainable_layers.append(f"encoder/layer_{str(11 - i)}")
         trainable_vars = [
             for var in trainable_vars
             if any([l in for l in trainable_layers])
         for var in trainable_vars: # G
         for var in self.bert.variables:
             if var not in self._trainable_weights:
         super(BertLayer, self).build(input_shape)
     def call(self, inputs):
         inputs = [K.cast(x, dtype="int32") for x in inputs]
         input_ids, input_mask, segment_ids = inputs
         bert_inputs = dict(
             input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids # H
         if self.pooling == "first":
             pooled = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
         elif self.pooling == "mean":
             result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
             mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1) # I
             masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (
                     tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
             input_mask = tf.cast(input_mask, tf.float32)
             pooled = masked_reduce_mean(result, input_mask)
             raise NameError("Undefined pooling type")
         return pooled
     def compute_output_shape(self, input_shape):
         return (input_shape[0], self.output_size)

# A Default number of top layers to unfreeze for training

# B Choice of regularization type

# C Pretrained model to use, this is the large uncased original version of the model

# D BERT embedding dimension, i.e., size of resulting output semantic vectors

# E Remove unused layers

# F Enforce number of unfrozen layers to fine-tune

# G Trainable weights

# H Inputs to BERT take a very specific triplet form, we will show how to generate it in the next Listing

# I BERT “masks” some words and then attempts to predict them as learning target

Unlike ELMo, we need to convert the input list of strings into 3 arrays – input ids, input masks and segment ids – prior to feeding them to the BERT model. The code for doing this is shown in Listing 4. Having converted the data into the right format, we use the remaining code in the same Listing 4 to build and train the Keras BERT Tensorflow hub model.

Listing 4. Code for converting data to form expected by BERT hub model, additionally for building and training it.

 def build_model(max_seq_length): # A
     in_id = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids")
     in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_masks")
     in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
     bert_inputs = [in_id, in_mask, in_segment]
     bert_output = BertLayer(n_fine_tune_layers=0)(bert_inputs) # B
     dense = tf.keras.layers.Dense(256, activation="relu")(bert_output)
     pred = tf.keras.layers.Dense(1, activation="sigmoid")(dense)
     model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)
     model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
     return model
 def initialize_vars(sess): # C
 bert_path = ""
 tokenizer = create_tokenizer_from_hub_module(bert_path) # D
 train_examples = convert_text_to_examples(train_x, train_y) # E
 test_examples = convert_text_to_examples(test_x, test_y)
 # Convert to features
 (train_input_ids,train_input_masks,train_segment_ids,train_labels) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=maxtokens) # F
 (test_input_ids,test_input_masks,test_segment_ids,test_labels) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=maxtokens)
 model = build_model(maxtokens) # G
 initialize_vars(sess) # H
 history =[train_input_ids, train_input_masks, train_segment_ids],# I
 train_labels,validation_data=([test_input_ids, test_input_masks,
 test_segment_ids],test_labels), epochs=5, batch_size=32)

# A Function for building model

# B We do not retrain any BERT layers, but rather use the pretrained model as an embedding and retrain some new layers on top of it

# C Vanilla tensorflow initialization calls

# D Create compatible tokenizer using function in BERT source  repo

# E Convert data to “InputExample” format using function in BERT source  repo

# F Convert  InputExample format into triplet final BERT input format, using function in BERT source  repo

# G Build the model

# H Instantiate variables

# I Train model

Similarly to the ELMo model we built in the previous subsection, we put a pair of layers on top of the pretrained model and only train those, which amounts to about 200 thousand parameters. With hyperparameters set at comparable values validation accuracies of approximately 77% and 71% for the email and movie review classification problems respectively were achieved (within 5 epochs).

Optimizing Performance

Taking a look at the performance results of the various algorithms from the previous sections, we might be tempted to make conclusions right away about which algorithm is the best-performing for each problem we looked at.

We must remember that we only know this to be true for sure at the hyperparameter settings at which we initially evaluated the algorithms, i.e., Nsamp = 1000, maxtokens = 50, maxtokenlen = 20, in addition to any algorithm-specific default parameter values. These are the number of samples per class in training data, maximum number of tokens per sample and maximum token length, respectively. In order to be able to make general statements, we need to explore the space of hyperparameters more thoroughly, by evaluating the performance of all algorithms at many hyperparameter settings, a process typically referred to as hyperparameter tuning or optimization. It may be that the best performance found through this process for each algorithm changes their performance ranking, and in general this helps us achieve better accuracies on our problems of interest.

Manual Hyperparameter Tuning

Hyperparameter tuning is often initially performed in a manual way driven by intuition. We describe such an approach here, for the hyperparameters Nsamp, maxtokens and maxtokenlen which are general across all the algorithms we considered.

Let’s first assume that the initial amount of data trained with, i.e., with Nsamp=1000, is all the data we have. We hypothesize that if we increase the number of tokens in the data for each document, i.e., maxtokens, and increase the maximum length of any such token, i.e., maxtokenlen, we can increase the amount of signal for making the classification decision and thereby the resulting accuracy.

For the email classification problem, we first increase both of these, from values of 50 and 20 respectively, to 100 each. Accuracy results for doing this for logistic regression (LR), support vector machines (SVMs), random forests (RFs), gradient boosting machines (GBMs), ELMo and BERT are shown in second data row of Table 1.  Furthermore, we increase maxtokens further to 200 to yield the results in the third data row of Table 1.

 We see based on this that, although SVMs is clearly the worst performing classifier for this problem, logistic regression, ELMo, and BERT can achieve nearly perfect performance. Note that ELMo is the clear winner in the presence of more signal, something we would have missed without the optimization step. The simplicity and speed of logistic regression would likely result in it being picked as the classifier of choice for this email classification problem.

Table 1. Comparison of algorithm accuracies at different general hyperparameter settings explored during the manual tuning process for the email classification example.

General Hyperparameter Settings







Nsamp = 1000 maxtokens = 50 maxtokenlen = 20







Nsamp = 1000 maxtokens = 100 maxtokenlen = 100







Nsamp = 1000, maxtokens = 200, maxtokenlen = 100







We now repeat a similar sequence of hyperparameter testing steps for the IMDB movie review classification problem, i.e., we first increase maxtokens and maxtokenlen to 100 each, and then increase maxtokens further to 200. The resulting algorithm performances are listed in Table 2, along with the performances at the initial hyperparameter settings.

BERT proves to be the best model for this problem across the board, followed by ELMo and logistic regression. Observe that this problem has more headroom for improvement, consistent with our earlier observation that this is a harder problem than email classification. This leads us to hypothesize that pretrained knowledge transfer is more impactful for harder problems, which makes intuitive sense. This is also consistent with general advice which stipulates that neural network models are likely to be preferable to other approaches when significant labelled data is available, assuming the problem to be solved are complex enough for the additional data to be needed in the first place.

Table 2. Comparison of algorithm accuracies at different general hyperparameter settings explored during the manual tuning process for the IMDB movie review classification example.

General Hyperparameter Settings







Nsamp = 1000 maxtokens = 50 maxtokenlen = 20







Nsamp = 1000 maxtokens = 100 maxtokenlen = 100







Nsamp = 1000 maxtokens = 200 maxtokenlen = 100







Systematic Hyperparameter Tuning

A number of tools exist for more systematic and exhaustive hyperparameter searches on ranges of hyperparameters. These include Python methods GridSearchCV, which performs an exhaustive search over a specified parameter grid, and HyperOpt, which does a random search over parameter ranges. Here, we present code for using GridSearchCV to tune an algorithm of choice as an illustrative example. Note that we tune only some internal algorithm-specific hyperparameters in this exercise, with the general ones we tuned in the last subsection fixed, for simplicity of illustration.

We pick email classification with RF at the initial general hyperparameter settings as our illustrative example. The reason for this choice is that it takes about a second for each fit of this algorithm on this problem, and because the grid search performs a lot of fits, this is an example that can be executed quickly for the most learning value for the reader.

 We first import the required method and check which RF hyperparameters are available for tuning:

 from sklearn.model_selection import GridSearchCV # A
 print("Available hyper-parameters for systematic tuning available with RF:")
 print(clf.get_params()) # B

# A GridSeachCV scikit-learn import statement

# B clf is the RF classifier from Listing 2.13

This yields the output:

 {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 10, 'n_jobs': 1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}

We pick three of these HPs to search over and specify three values for each of them:

 param_grid = {
     'min_samples_leaf': [1, 2, 3],
     'min_samples_split': [2, 6, 10],
     'n_estimators': [10, 100, 1000]

We then carry out the Grid Search, using the following code, making sure to print out final test accuracy and best hyperparameter values:

 grid_search = GridSearchCV(estimator = clf, param_grid = param_grid,
                           cv = 3, n_jobs = -1, verbose = 2) # A, train_y) # B
 print("Best parameters found:") # C
 print("Estimated accuracy is:")
 acc_score = accuracy_score(test_y, grid_search.best_estimator_.predict(test_x))

# A Define grid search object with specified hyperparameter grid

# B Fit the grid search to the data

# C Display results

This experiment required training the classifier at 3*3*3=27 points, because each of the three hyperparameter grids has three requested points on it. The overall experiment took under five minutes to complete, and yielded an accuracy of 95.7%. This is an improvement of more than a one percent boost over the original score of 94.5%. The raw output from the code is shown below, specifying best HP values:

 Best parameters found:
 {'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 1000}
 Estimated accuracy is:

Indeed, when we performed the tuning across the board on all classifiers, we found that we could boost the performance of each by 1-2%, without affecting the conclusions on the best classifier for the problem that was reached in the previous subsection.

That’s all for this article. If you want to learn more about the book, you can check it out on our browser-based liveBook reader here.


[1]P. Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2018