From Deep Learning for Natural Language Processing by Stephan Raaijmakers

This article introduces you to working with BERT.

Take 40% off Deep Learning for Natural Language Processing by entering fccraaijmakers into the discount code box at checkout at

The financial costs of pretraining BERT and related models like XLNET from scratch on large amounts of data can be prohibitive. The original BERT paper (Devlin, 2018) mentions that:

  • “[The] training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total) [with several pretraining phases]. Each pretraining [phase] took 4 days to complete.”

If the authors used Google Cloud TPU’s, which are GPU’s optimized for Tensorflow computations – Tensorflow being Google’s native deep learning formalism, with a price per hour of such a TPU currently from US$4,50 to US$8,00, this amounts to a total pre-training price of US$6,912 to US$12,288:

  • 16 TPU devices * 96 hours * 4.5-8.00 US$ per hour.

For XLNet, with its more complex Permutation Language Modeling approach, non-official estimates amount to a rather steep $61,440 (see 1143397614093651969?s=20).

Likewise, the cost of training a Transformer for the GPT-3 language model, which was trained on one billion words has been estimated to be around a whopping $4,6 million (, and such models are only available through commercial licenses.

Luckily, smaller pretrained BERT or XLNET models are becoming increasingly available for free, and they may well serve as stepping stones for fine-tuning. See for an overview of many Transformer models for a variety of languages.

This means that, in practice, you start from downloading a pre-trained BERT or XLNET model, incorporate it into your network, and fine-tune it with much more manageable, smaller datasets. In this article, we’ll see how that works. First, let’s start with incorporating existing BERT models in our models. For this to work, we need a dedicated BERT layer: a landing hub for BERT models.

A BERT layer.

In deep learning networks, BERT layers, like any other embedding layers, are usually positioned right on top of an input layer:

Figure 1. The position of a BERT layer in a deep learning network.

They serve a purpose similar to any other embedding layer: they encode words in the input layer to embedding vectors. In order to be able to work with BERT, we need two things:

  • A pre-trained BERT model
  • A facility for importing such a model and exposing it to the rest of our code.

Google has made available to the general audience a valuable platform for obtaining pre-trained BERT models: Tensorflow Hub ( This is a platform for downloading not only BERT models and the like, but, in general, functional parts of pre-constructed deep learning networks, which, in Google’s idiom, are sub-graphs of Tensorflow graphs. Recall that Tensorflow is Google’s native deep learning formalism; as of 2020, Keras uses exclusively Tensorflow ‘under the hood’, as a back-end (it has given up on the Theano backend).

This means we can download from Tensorflow Hub both models and other useful code. First, let’s take a look at Tensorflow Hub. When you visit collections/bert/1, you find a large list of all kinds of BERT models. One set of models is based on the official implementation of (Devlin, 2018), and although having been superseded with other implementations in the course of time, we use one of those models.

In order to make use of all that, let’s first define a special-purpose layer for attaching downloaded models to. We can define our own layers in Keras as classes. Such class definitions need only three obligatory methods: __init__(), build() and call().

You want to use existing BERT models in your applications. How do you attach such models to your code? You decide to implement a dedicated Keras layer for harboring BERT models.  

— Scenario: Working with BERT.

Listing 10.1 shows how to implement such a Keras BERT layer. Calling this layer entails downloading a BERT model and optimizing it for fine-tuning if desired. The fine-tuning consists of specifying the number of BERT attention layers that are undergoing the finetuning. This is a crucial parameter, that determines both the time complexity of the operation, and the quality of the final model: fine-tuning more layers generally leads to higher quality, but it comes with a time trade-off.

Starting with Keras version 2.3 (2019), Keras stopped supporting Theano as a backend, and now uses tensorflow exclusively. It’s advisable to use the Tensorflow-embedded version of Keras. The BERT libraries we use in this article depend on Tensorflow 2.0 and above. You can read up on this in Keras’ TensorFlow documentation.

Listing 1: A dedicated Keras layer for BERT models[1].

 class BertLayer(tf.keras.layers.Layer):
     def __init__( 
         **kwargs     ):
         self.n_fine_tune_layers = n_fine_tune_layers
         self.trainable = True 
         self.output_size = 768 
         self.bert_path = bert_path         super(BertLayer, self).__init__(**kwargs)
     def build(self, input_shape): 
         self.bert = hub.Module(
         trainable_vars = self.bert.variables         trainable_vars = [var for var in trainable_vars if not "/cls/" in] ⑦         trainable_vars = trainable_vars[-self.n_fine_tune_layers :] 
         for var in trainable_vars: 
         for var in self.bert.variables:              if var not in self._trainable_weights:
                 self._non_trainable_weights.append(var)         super(BertLayer, self).build(input_shape)
     def call(self, inputs): 
         inputs = [K.cast(x, dtype="int32") for x in inputs] 
         input_ids, input_mask, segment_ids = inputs 
         bert_inputs = dict(
             input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
         )         result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[              "sequence_output"
         return result
     def compute_output_shape(self, input_shape): 
         return (input_shape[0], self.output_size)

The first obligatory method to implement when defining a Python class: the initialization of the class object.

Setting the number of layers that need to be fine-tuned should we choose to fine-tune an imported BERT model.

The path to a downloadable (or, if you like, locally installed) BERT model. This particular url leads to an uncased (lowercase) ready-made BERT model, with twelve hidden layers, and a standard output dimension of 768 (see Chapter 9).

We switch the ‘trainable’ flag to True, meaning that the standard setting is used to finetune the import BERT model.

Setting the output size (again, standard 768 for BERT).

The build method does administrative work on the weights of the layer. This method is called automatically by call on its first call.

The trainable variables refer to layers in the BERT model we’re using. Because the BERT models are complex with many layers, we can opt to limit training (fine-tuning) to a subset of these layers. The BERT model has been loaded into self.bert; it has a folder structure, where paths that lack the string “/cls/” lead to trainable layers (for idiosyncratic reasons).

The subset of layers to be fine-tuned is created.

The trainable layer variables are appended to the (initially empty) list of trainable weights.

Likewise, the untrainable variables (the remaining variables) are stored.

The call method defines what happens if we call the layer, i.e. apply it as a function to input data.

The input data is cast to 32-bit integers, to avoid unexpected numerical values.

The inputs are decomposed into token IDs, an input mask (defining to which tokens the model should attend to), and a list of segment IDs. See Listing 7 for details.

These ingredients are stored as key/value pairs in a dictionary.

The result is defined as the application of a self.bert function (referring back to the Tensorflow Hub model) to the input.

An obligatory function that computes the shapes of input and output.

As mentioned, the trade-off we make between trainable and untrainable variables allow us to directly balance time complexity and quality. Fine-tuning less attention layers leads to lower quality of the embeddings produced, but it also guarantees shorter training time. It’s up to the developer to balance this trade-off. Before we dive into fine-tuning an existing BERT model, let’s take a quick look at the alternative: building your own BERT model and training it on your own data from the start.

Training BERT on your own data

Having enough resources available (data, GPU quota, and patience), it’s possible to bootstrap a BERT model from your own data.

You choose to build a BERT model entirely from scratch. Maybe you want to have full control over the data you base your BERT model on. This is entirely possible, at the cost of lots of GPU cycles if your dataset is large. 

— Scenario: Training BERT on your own data.

Many Python frameworks are currently based on Keras that allow us to work smoothly with BERT, like fast-bert ( and keras-bert ( For other options, see keras-bert offers simple methods for directly creating a proprietary BERT model, and we use it in our examples below. Needless to say that deploying BERT starts with data. Our data comes in the form of paired sentences, as BERT is also trained on the next sentence prediction task.

Let’s assume we process a set of documents into a newline separated list of sentences at beforehand, like this fragment of Edgar Allen Poe’s The Cask of Amontillado:

 He had a weak point — this Fortunato — although in other regards he was a man to be respected and even feared. He prided himself on his connoisseurship in wine. Few Italians have the true virtuoso spirit. For the most part their enthusiasm is adopted to suit the time and opportunity—to practice imposture upon the British and Austrian millionaires.

Using a simple sentence splitter gets you to this point. Optionally, one can split not only on closing punctuation like full stops, but also on commas, dashes, etc.; this creates additional (pseudo-)sentences like

  • He had a weak point
  • this Fortunato
  • although in other regards he was a man to be respected and even feared and leads to more sentence pairs.

The following function turns such a list into the paired data we need for training a BERT model:

Listing 2: Processing input data for BERT.

 def readSentencePairs(fn):      with open(fn) as f:
         lines = f.readlines() 
     pairs=zip(lines, lines[1:])      paired_sentences=[[a.rstrip().split(),b.rstrip().split()] for (a,b) in pairs]      tokenD = get_base_dict()  
     for pairs in paired_sentences:          for token in pairs[0] + pairs[1]:
             if token not in tokenD:
                 tokenD[token] = len(tokenD)
     tokenL = list(tokenD.keys())  
     return (paired_sentences,tokenD,tokenL) 

We invoke the function with a filename.

All lines in the files are read into one list.

From this list, pairs are created with the Python built-in zip().

All sentence pairs in this list are split into words, and newlines are removed.

keras_bert has a base dictionary, containing a few special symbols, like UNK,CLS, and SEP. This dictionary is expanded with the words in our data.

For every pair of sentences, words in those sentences which aren’t already in the dictionary are added with a new index number.

All tokens in the dictionary are gathered in a list.

The paired sentences, the token dictionary and the token list are returned.

This produces, for our example, a nested list of paired sentences, split into words, like

     [['he','had','a', 'weak', 'point','—','this', 'Fortunato',...],
     [['He','prided', 'himself', 'on', 'his', 'connoisseurship', 'in', 'wine']

Next, leaning on the pre-cooked methods of keras_bert, we can build and train a BERT model quite swiftly (see We start by defining a generator function that produces an iterable object with pointers to next data. Instead of generating all BERT data in one go (which can become prohibitive for large datasets), this generator creates an object for working effectively and memory-friendly through large amounts of data.

Listing 3: Generating batch data for BERT.

 from keras_bert gen_batch_inputs
 def BertGenerator(paired_sentences, tokenD, tokenL):      while True: 
         yield gen_batch_inputs( 

The generator uses paired sentence data generated by readSentencePairs().

It enters a perpetual loop (ended by an external control facility that doesn’t bother us here).

Using the keras+bert routine gen_batch_inputs(), and specifying the probability for masking out words (mask_rate) and a parameter for swapping sentences that controls putting one sentence as the continuation of the next, or vice versa; the model has to determine what is the right order.

Here’s how this generator is used.

Listing 4: Training a proprietary BERT model on data.

 from tensorflow import keras from keras_bert import get_model, compile_model
 def buildBertModel(paired_sentences,tokenD,tokenL, model_path):      model = get_model( 
 sentences="./my-sentences.txt"  (paired_sentences,tokenD,tokenL)=readSentencePairs(sentences) model_path="./bert.model" buildBertModel(paired_sentences,tokenD,tokenL,model_path)

The model building function takes the paired sentence data and a model path as input parameters.

The keras_bert method get_model instantiates a model structure, with values for the number of attention heads per layer, the number of transformer layers, the embedding size, the size of the feed forward layers, the length of the token sequences, the corresponding number of positions (for positional encoding), and a dropout rate.

The model is fitted on the data produced by the generator, for the number of epochs specified, and for a specified number of steps within every epoch.

The model is saved after training.

This is how everything comes together.

Under the hood, keras_bert inserts the CLS and SEP delimiters in the paired sentence data, and tokenizes the words in the inputs into subwords using the WordPiece algorithm.

Let’s take a look at a manual approach to this, to make clear what is happening here.

We first define a simple class called InputExample:

Listing 5: InputExample class.

 class InputExample(object):     def __init__(self, text, label=None):
         self.text = text
         self.label = label

Instances of the class are just containers holding labeled text items. We need those for storing our labeled BERT sentences. Next, we need a tokenizer to tokenize our input text. We use another handy BERT python library for this: bert-for-tf2 (BERT for Tensorflow version 2 and above). We install this library under python3 as follows:

 sudo pip3 install bert-for-tf2

After this, it can be loaded with

 import bert

Listing 6: Obtaining a tokenizer from Tensorflow Hub.

 import tensorflow_hub as hub import tensorflow as tf from bert import bert_tokenization
 def create_tokenizer_from_hub_module(bert_hub_path):   with tf.Graph().as_default():
     bert_module = hub.Module(bert_hub_path) 
     tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
     with tf.compat.v1.Session() as sess: 
       vocab_file, do_lower_case =[tokenization_info["vocab_file"],
   return bert_tokenization.FullTokenizer(
       vocab_file=vocab_file, do_lower_case=do_lower_case) 

We are operating on a Tensorflow graph ( tf/Graph).

The path to our BERT model on Tensorflow Hub.

We obtain vocabulary and case information from the BERT model. The case information expresses whether the model uses lowercase for representing words.

A fresh tokenizer is created that stores the vocabulary and case information.

We invoke this like:

 bert_path="" tokenizer = create_tokenizer_from_hub_module(bert_path)

This returns, for a specified BERT model from Tensorflow Hub, a tokenizer that contains a token dictionary mapping words to integers. Given this tokenizer, and for a given InputExample instance, we can now generate the feature representation a BERT model wants: a tokenized text, an input mask that selects the tokens the model should pay attention to, and a set of labels.

Remember that we teach the labeling, and, on the fly, fine-tune the BERT model.

Listing 7: From InputExample to features.

 def convert_single_example(tokenizer, example, max_seq_length=256):      tokens_a = tokenizer.tokenize(example.text) 
     if len(tokens_a) > max_seq_length - 2: 
         tokens_a = tokens_a[0 : (max_seq_length - 2)]
     tokens = []
     segment_ids = []
     segment_ids.append(0)     for token in tokens_a:
     segment_ids.append(0)     input_ids = tokenizer.convert_tokens_to_ids(tokens) 
     input_mask = [1] * len(input_ids) 
     while len(input_ids) < max_seq_length: 
     return input_ids, input_mask, segment_ids, example.label 

We invoke the method with a Tensorflow Hub tokenizer, an instance of InputExample, and the maximum sequence length we allow, set to a standard value of 256.

We tokenize the input text with the tokenizer, obtaining a list of tokens.

We start populating the tokens array. It starts with the pseudo-token [CLS], indicating the start of a sequence, and it ends with [SEP], which is why we reserve two extra positions and check on not exceeding max_seq_length-2. The segment_ids array is a list of zeroes proportional to the length of the token array. Its first and last positions implicitly (and redundantly) encode the start and end position of the current text.

We convert tokens into token IDs with the tokenizer.

We specify an input mask: a list of 1s that correspond to our tokens, prior to padding the input text with zeroes. Only non-zero tokens are attended to by BERT.

We pad all arrays with zeroes.

We return the token array, the input mask, the segment array, and the label of the input example.

Doing this for a bulk of examples is done by:

Listing 8: Converting examples to features.

 def convert_examples_to_features(tokenizer, examples, max_seq_length=256):     input_ids, input_masks, segment_ids, labels = [], [], [], []     for example in examples:       input_id, input_mask, segment_id, label = convert_single_example(tokenizer, example, max_seq_length)        input_ids.append(input_id) 
     return (
         np.array(labels).reshape(-1, 1)

Convert a single example.

Add to collective array.

Return results.

The following diagram describes the process.

Figure 2. Processing data for BERT.

Now, suppose we labeled data stored in CSV format like

 text,label I hate pizza, negative
 I like icecream, positive

Such pairs of texts and labels, once extracted from the CSV data, can be converted into InputExamples with:

Listing 9: From pairs of texts and labels to InputExamples.

 def convert_text_to_examples(texts, labels):     InputExamples = []     for text, label in zip(texts, labels):
         InputExample(text=text, label=label)
     return InputExamples

Let’s process this CSV data, using the previously defined conversion methods and our tokenizer. We generate a number of arrays holding the conversion results:

Listing 10: Processing CSV data

 import pandas as pd from sklearn.preprocessing import LabelEncoder
 def loadData(trainCSV, testCSV, valCSV, tokenizer):      max_seq_length=256
     train = pd.read_csv(trainCSV) 
     test = pd.read_csv(testCSV)
     val = pd.read_csv(valCSV)
     label_encoder = LabelEncoder().fit(pd.concat([train['label'], val['label']])) 
     y_train = label_encoder.fit_transform(pd.concat([train['label'], val['label']]))     y_test = label_encoder.fit_transform(pd.concat([test['label'], val['label']]))     y_val = label_encoder.fit_transform(pd.concat([train['label'], val['label']]))
     train_examples = convert_text_to_examples(train['text'], y_train) 
     test_examples = convert_text_to_examples(test['text'], y_test)
     val_examples = convert_text_to_examples(val['text'], y_val)
     (train_input_ids, train_input_masks, train_segment_ids, train_labels) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)      (test_input_ids, test_input_masks, test_segment_ids, test_labels) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=max_seq_length)     (val_input_ids, val_input_masks, val_segment_ids, val_labels) = convert_examples_to_features(tokenizer, val_examples, max_seq_length=max_seq_length)
     return [(train_input_ids,train_input_masks,train_segment_ids,train_labels),
             (test_input_ids,test_input_masks,test_segment_ids, test_labels),
             (val_input_ids,val_input_masks,val_segment_ids, val_labels)] 

We invoke the method with CSV file names for training, testing and validation data, plus a tokenizer.

We use pandas to read the CSV text into dictionary structures. Our CSV data has two fields: text and label.

We use the sklearn LabelEncoder for converting the labels to numerical values.

The texts in our training, test and validation data are converted to InputExamples.

The various InputExamples are converted to array tuples (‘features’).

Results are returned.

The following figure illustrates this flow.

Figure 3. Processing CSV data for fine-tuning BERT.

Now we’re ready to feed our data to a BERT model that fine-tunes in the course of learning an additional labeling task.

That’s all for this article. If you want to learn more, you can check out the book on Manning’s liveBook platform here.


[1] This layer is currently found in many implementations, including, Its origins are unclear