By François Chollet

In this article, we’ll learn about deep learning models that can process text (understood as sequences of word or sequences of characters), timeseries, and sequence data in general.

Save 37% off Deep Learning with Python with code fccchollet at manning.com.

Working with text data

Text is one of the most widespread form of sequence data. It can be understood either as a sequence of characters, or a sequence of words, albeit it is most common to work at the level of words. The deep learning sequence processing models that we’ll introduce can use text to produce a basic form of natural language understanding, sufficient for applications ranging from document classification, sentiment analysis, author identification, or even question answering (in a constrained context). Keep in mind throughout this article that none of the deep learning models you see truly “understands” text in a human sense, rather, these models are able to map the statistical structure of written language, which is sufficient to solve many simple textual tasks. Deep learning for natural language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels. Like all other neural networks, deep learning models don’t take as input raw text: they only work with numeric tensors. Vectorizing text is the process of transforming text into numeric tensors. This can be done in multiple ways:

  • By segmenting text into words, and transforming each word into a vector.

  • By segmenting text into characters, and transforming each character into a vector.

  • By extracting “N-grams” of words or characters, and transforming each N-gram into a vector. “N-grams” are overlapping groups of multiple consecutive words or characters.

Collectively, the different units into which you can break down text (words, characters or N-grams) are called “tokens”, and breaking down text into such tokens is called “tokenization”. All text vectorization processes consist in applying some tokenization scheme, then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are what get fed into deep neural networks. There are multiple ways to associate a vector to a token. In this section we will present two major ones: one-hot encoding of tokens, and token embeddings (typically used exclusively for words, and called “word embeddings“). In the remainder of this section, we will explain these techniques and show concretely how to use them to go from raw text to a Numpy tensor that you can send to a Keras network.


Figure 1 From text to tokens to vectors


 

Understanding N-grams and “bag-of-words”.

Word N-grams are groups of N (or fewer) consecutive words that you can extract from a sentence. The same concept may also be applied to characters instead of words. Here’s a simple example. Consider the sentence: “The cat sat on the mat”. It may be decomposed as the following set of 2-grams:

  
 {"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}
  

It may also be decomposed as the following set of 3-grams:

  
 {"The", "The cat", "cat", "cat sat", "The cat sat", "sat", "sat on", "on", "cat sat on", "on the", "the", "sat on the", "the mat", "mat", "on the mat"}
  

Such a set is called a “bag-of-3-grams” (resp. 2-grams). The term “bag” here refers to the fact that we’re dealing with a set of tokens rather than a list or sequence: the tokens have no specific order. This family of tokenization method is called “bag-of-words.”

Because bag-of-words aren’t an order-preserving tokenization method (the tokens generated are understood as a set, not a sequence, and the general structure of the sentences is lost), bag-of-words tend to be used in shallow language processing models rather than in deep learning models. Extracting N-grams is a form of feature engineering, and deep learning does away with this kind of rigid and brittle feature engineering, replacing it with hierarchical feature learning. One-dimensional convnets and recurrent neural networks are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by looking at continuous word or character sequences. For this reason, we won’t cover N-grams any further in this article. But keep in mind that they’re powerful, unavoidable features of engineering tools when using lightweight shallow text processing models such as logistic regression and random forests.

 

One-hot encoding of words or characters

One-hot encoding is the most common, most basic way to turn a token into a vector. It consists of associating a unique integer index to every word, then turning this integer index i into a binary vector of size N, the size of the vocabulary, that’d be all-zeros except for the i-th entry, which would be one.

One-hot encoding can be done at the character level as well. To unambiguously drive home what one-hot encoding is and how to implement it, here are two toy examples of one-hot encoding: one for words, the other for characters.

Listing 1 Word level one-hot encoding (toy example)

  
 import numpy as np
  
 # This is our initial data; one entry per "sample"
 # (in this toy example, a "sample" is just a sentence, but
 # it could be an entire document).
 samples = ['The cat sat on the mat.', 'The dog ate my homework.']
  
 # First, build an index of all tokens in the data.
 token_index = {}
 for sample in samples:
     # We simply tokenize the samples via the `split` method.
     # in real life, we would also strip punctuation and special characters
     # from the samples.
     for word in sample.split():
         if word not in token_index:
             # Assign a unique index to each unique word
             token_index[word] = len(token_index) + 1
             # Note that we don't attribute index 0 to anything.
  
 # Next, we vectorize our samples.
 # We will only consider the first `max_length` words in each sample.
 max_length = 10
  
 # This is where we store our results:
 results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
 for i, sample in enumerate(samples):
     for j, word in list(enumerate(sample.split()))[:max_length]:
         index = token_index.get(word)
         results[i, j, index] = 1.
  

Listing 2 Character level one-hot encoding (toy example)

  
 import string
  
 samples = ['The cat sat on the mat.', 'The dog ate my homework.']
 characters = string.printable  # All printable ASCII characters.
 token_index = dict(zip(range(1, len(characters) + 1), characters))
  
 max_length = 50
 results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
 for i, sample in enumerate(samples):
     for j, character in enumerate(sample):
         index = token_index.get(character)
         results[i, j, index] = 1.
  

Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. This is what you should be using, as it takes care of a number of important features, such as stripping special characters from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with huge input vector spaces).

Listing 3 Using Keras for word-level one-hot encoding

  
 from keras.preprocessing.text import Tokenizer
  
 samples = ['The cat sat on the mat.', 'The dog ate my homework.']
  
 # We create a tokenizer, configured to only take
 # into account the top-1000 most common on words
 tokenizer = Tokenizer(num_words=1000)
 # This builds the word index
 tokenizer.fit_on_texts(samples)
  
 # This turns strings into lists of integer indices.
 sequences = tokenizer.texts_to_sequences(samples)
  
 # You could also directly get the one-hot binary representations.
 # Note that other vectorization modes than one-hot encoding are supported!
 one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
  
 # This is how you can recover the word index that was computed
 word_index = tokenizer.word_index
 print('Found %s unique tokens.' % len(word_index))
  

A variant of one-hot encoding is the “one-hot hashing trick”, which can be used when the number of unique tokens in your vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a lightweight hashing function. The main advantage of this method is that it does away with maintaining an explicit word index, which saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available data). The one drawback of this method is that it’s susceptible to “hash collisions”: two different words may end up with the same hash, and subsequently any machine learning model looking at these hashes won’t be able to tell the difference between these words. The likelihood of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed.

Listing 4 Word-level one-hot encoding with hashing trick (toy example)

  
 samples = ['The cat sat on the mat.', 'The dog ate my homework.']
  
 # We will store our words as vectors of size 1000.
 # Note that if you have close to 1000 words (or more)
 # you will start seeing many hash collisions, which
 # will decrease the accuracy of this encoding method.
 dimensionality = 1000
 max_length = 10
  
 results = np.zeros((len(samples), max_length, dimensionality))
 for i, sample in enumerate(samples):
     for j, word in list(enumerate(sample.split()))[:max_length]:
         # Hash the word into a "random" integer index
         # that is between 0 and 1000
         index = abs(hash(word)) % dimensionality
         results[i, j, index] = 1.
  

Using word embeddings

Another popular and powerful way to associate a vector with a word is the use of dense “word vectors”, also called “word embeddings”. Although the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros) and high-dimensional (same dimensionality as the number of words in the vocabulary), “word embeddings” are low-dimensional floating-point vectors (i.e. “dense” vectors, as opposed to sparse vectors). Unlike word vectors obtained via one-hot encoding, word embeddings are learned from data. It’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with massive vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or higher (capturing a vocabulary of 20,000 token in this case). Word embeddings pack more information into far fewer dimensions.


Figure 2 Although word representations obtained from one-hot encoding or hashing are sparse, high-dimensional, and hard-coded, word embeddings are dense, relatively low-dimensional, and learned from data.


There are two ways to obtain word embeddings:

  • Learn word embeddings jointly with the main task you care about (e.g. document classification or sentiment prediction). In this setup, you start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.

  • Load into your model word embeddings that were pre-computed using a different machine learning task than the one you are trying to solve. These are called “pre-trained word embeddings”.

Let’s take a look at both.

LEARNING WORD EMBEDDINGS WITH THE Embedding LAYER

The simplest way to associate a dense vector to a word would be to pick the vector at random. The problem with this approach is that the resulting embedding space would have no structure: for instance, the words “accurate” and “exact” may end up with completely different embeddings, even though they are interchangeable in most sentences. It would be very difficult for a deep neural network to make sense of such a noisy, unstructured embedding space.

To get a bit more abstract: the geometric relationships between word vectors should reflect the semantic relationships between these words. Word embeddings are meant to map human language into a geometric space. For instance, in a reasonable embedding space, we would expect synonyms to be embedded into similar word vectors, and in general we would expect the geometric distance (e.g. L2 distance) between any two word vectors to relate to the semantic distance of the associated words (words meaning very different things would be embedded to points far away from each other, while related words would be closer). Even beyond mere distance, we may want specific directions in the embedding space to be meaningful. To make this clearer, let’s look at a concrete example.

In figure 3, we embedded four words on a 2D plane, “cat”, “dog”, “wolf” and “tiger”. With the vector representations we chose here, some semantic relationships between these words can be encoded as geometric transformations. For instance, a same vector allows to go from “cat” to “tiger” and from “dog” to “wolf”: this vector could be interpreted as the “from pet to wild animal” vector. Similarly, another vector allows to go from “dog” to “cat” and from “wolf” to “tiger”, which could be interpreted as a “from canine to feline” vector.


Figure 3 A toy example of a word embedding space


In real-world word embedding spaces, common examples of meaningful geometric transformations are “gender vectors” and “plural vector”. For instance, by adding a “female vector” to the vector “king”, one obtains the vector “queen”. By adding a “plural vector”, one obtains “kings”. Word embedding spaces typically feature thousands of such interpretable and potentially useful vectors.

Is there some “ideal” word embedding space that’d perfectly map human language and could be used for any natural language processing task? Possibly, but in any case, we’ve yet to compute anything of the sort. Also, there isn’t such a thing as “human language”, there are many different languages and they aren’t isomorphic, as a language is the reflection of a specific culture and a specific context. But more pragmatically, what makes a good word embedding space depends heavily on your task: the perfect word embedding space for an English-language movie review sentiment analysis model may look different from the perfect embedding space for an English-language legal document classification model, because the importance of certain semantic relationships varies from task to task.

It’s reasonable to learn a new embedding space with every new task. Thankfully, backpropagation makes this easy, and Keras makes it even easier. It’s about learning the weights of a layer: the Embedding layer.

Listing 6.5 Instantiating an Embedding layer.

  
 from keras.layers import Embedding
  
 # The Embedding layer takes at least two arguments:
 # the number of possible tokens, here 1000 (1 + maximum word index),
 # and the dimensionality of the embeddings, here 64.
 embedding_layer = Embedding(1000, 64)
  

The Embedding layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.

  
 word index -> Embedding layer -> corresponding word vector
  

The Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. It can embed sequences of variable lengths; for instance, we could feed into our embedding layer (above) batches that could have shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a batch must have the same length, though (because we need to pack them into a single tensor) sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

This layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). Such a 3D tensor can then be processed by a RNN layer or a 1D convolution layer.

When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, like with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something that the downstream model can exploit. Once fully trained, your embedding space shows a lot of structure—a kind of structure specialized for the specific problem you were training your model for.

Let’s apply this idea to the IMDB movie review sentiment prediction task that you’re already familiar with. With, let’s quickly prepare the data. We’ll restrict the movie reviews to the top 10,000 most common words (like we did the first time we worked with this dataset), and cut the reviews after only 20 words. Our network learns 8-dimensional embeddings for each of the 10,000 words, turns the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flattens the tensor to 2D, and trains a single Dense layer on top for classification.

Listing 6 Loading the IMDB data for use with an Embedding layer.

  
 from keras.datasets import imdb
 from keras import preprocessing
  
 # Number of words to consider as features
 max_features = 10000
 # Cut texts after this number of words
 # (among top max_features most common words)
 maxlen = 20
  
 # Load the data as lists of integers.
 (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
  
 # This turns our lists of integers
 # into a 2D integer tensor of shape `(samples, maxlen)`
 x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
 x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
  

Listing 7 Using an Embedding layer and classifier on the IMDB data.

  
 from keras.models import Sequential
 from keras.layers import Flatten, Dense
  
 model = Sequential()
 # We specify the maximum input length to our Embedding layer
 # so we can later flatten the embedded inputs
 model.add(Embedding(10000, 8, input_length=maxlen))
 # After the Embedding layer,
 # our activations have shape `(samples, maxlen, 8)`.
  
 # We flatten the 3D tensor of embeddings
 # into a 2D tensor of shape `(samples, maxlen * 8)`
 model.add(Flatten())
  
 # We add the classifier on top
 model.add(Dense(1, activation='sigmoid'))
 model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
 model.summary()
  
 history = model.fit(x_train, y_train,
                     epochs=10,
                     batch_size=32,
                     validation_split=0.2)
  

We get to a validation accuracy of ~76%, which is pretty good considering that we’re only looking at the first twenty words in every review. But note that merely flattening the embedded sequences and training a single Dense layer on top leads to a model that treats each word in the input sequence separately, without considering inter-word relationships and structure sentence (it’d likely treat both “this movie is shit” and “this movie is the shit” as being negative “reviews”). It’d be much better to add recurrent layers or 1D convolutional layers on top of the embedded sequences to learn features that take into account each sequence as a whole.

USING PRE-TRAINED WORD EMBEDDINGS

Sometimes, you have too little training data available to learn an appropriate task-specific embedding of your vocabulary. What to do then?

Instead of learning word embeddings jointly with the problem you want to solve, you could be loading embedding vectors from a pre-computed embedding space known to be highly structured and to exhibit useful properties—that captures generic aspects of language structure. The rationale behind using pre-trained word embeddings in natural language processing is much the same as for using pre-trained convnets in image classification: we don’t have enough data available to learn truly powerful features on our own, but we expect the features that we need to be fairly generic, i.e. common visual features or semantic features. In this case it makes sense to reuse features learned on a different problem.

Such word embeddings are generally computed using word occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of techniques, some involving neural networks, others not. The idea of a dense, low-dimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s, but it only started taking off in research and industry applications after the release of one of the most famous and successful word embedding scheme: the Word2Vec algorithm, developed by Mikolov at Google in 2013. Word2Vec dimensions capture specific semantic properties, e.g. gender.

Various pre-computed databases of word embeddings can be downloaded and start using in a Keras Embedding layer. Word2Vec is one of them. Another popular one is called “GloVe”, developed by Stanford researchers in 2014. It stands for “Global Vectors for Word Representation”, and it’s an embedding technique based on factorizing a matrix of word co-occurrence statistics. Its developers have made available pre-computed embeddings for millions of English tokens, obtained from Wikipedia data or from Common Crawl data.

Let’s take a look at how you can get started using GloVe embeddings in a Keras model. The same method is valid for Word2Vec embeddings or any other word embedding database that you can download. We’ll use this example to refresh the text tokenization techniques we introduced a few paragraphs ago: we’ll start from raw text, and work our way up.

Putting it all together: from raw text to word embeddings

We’ll use a model similar to the one we went over—embedding sentences in sequences of vectors, flattening them and training a Dense layer on top. But we’ll do it using pre-trained word embeddings, and instead of using the pre-tokenized IMDB data packaged in Keras, we’ll start from scratch, by downloading the original text data.

DOWNLOAD THE IMDB DATA AS RAW TEXT

First, head to ai.stanford.edu/amaas/data/sentiment/[ai.stanford.edu/ amaas/data/sentiment] and download the raw IMDB dataset (if the URL isn’t working anymore, Google “IMDB dataset”). Uncompress it.

Now let’s collect the individual training reviews into a list of strings, one string per review, and let’s also collect the review labels (positive / negative) into a labels list:

Listing 8 Processing the labels of the raw IMDB data

  
 import os
  
 imdb_dir = '/Users/fchollet/Downloads/aclImdb'
 train_dir = os.path.join(imdb_dir, 'train')
  
 labels = []
 texts = []
  
 for label_type in ['neg', 'pos']:
     dir_name = os.path.join(train_dir, label_type)
     for fname in os.listdir(dir_name):
         if fname[-4:] == '.txt':
             f = open(os.path.join(dir_name, fname))
             texts.append(f.read())
             f.close()
             if label_type == 'neg':
                 labels.append(0)
             else:
                 labels.append(1)
  

TOKENIZE THE DATA

Let’s vectorize the texts we collected, and prepare a training and validation split. We’ll use the concepts we introduced earlier in this article.

Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise, task-specific embeddings are likely to outperform them), we’ll add the following twist: we restrict the training data to its first 200 samples. We’ll learn to classify movie reviews after looking at 200 examples…

Listing 9 Tokenizing the text of the raw IMDB data

  
 from keras.preprocessing.text import Tokenizer
 from keras.preprocessing.sequence import pad_sequences
 import numpy as np
  
 maxlen = 100  # We will cut reviews after 100 words
 training_samples = 200  # We will be training on 200 samples
 validation_samples = 10000  # We will be validating on 10000 samples
 max_words = 10000  # We will only consider the top 10,000 words in the dataset
  
 tokenizer = Tokenizer(num_words=max_words)
 tokenizer.fit_on_texts(texts)
 sequences = tokenizer.texts_to_sequences(texts)
  
 word_index = tokenizer.word_index
 print('Found %s unique tokens.' % len(word_index))
  
 data = pad_sequences(sequences, maxlen=maxlen)
  
 labels = np.asarray(labels)
 print('Shape of data tensor:', data.shape)
 print('Shape of label tensor:', labels.shape)
  
 # Split the data into a training set and a validation set
 # But first, shuffle the data, since we started from data
 # where sample are ordered (all negative first, then all positive).
 indices = np.arange(data.shape[0])
 np.random.shuffle(indices)
 data = data[indices]
 labels = labels[indices]
  
 x_train = data[:training_samples]
 y_train = labels[:training_samples]
 x_val = data[training_samples: training_samples + validation_samples]
 y_val = labels[training_samples: training_samples + validation_samples]
  

DOWNLOAD THE GLOVE WORD EMBEDDINGS

Head to nlp.stanford.edu/projects/glove (where you can learn more about the GloVe algorithm), and download the pre-computed embeddings from 2014 English Wikipedia. It’s an 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.

PRE-PROCESS THE EMBEDDINGS

Let’s parse the un-zipped file (it’s a txt file) to build an index mapping words (as strings) to their vector representation (as number vectors).

Listing 10 Parsing the GloVe word embeddings file

  
 glove_dir = '/Users/fchollet/Downloads/glove.6B'
  
 embeddings_index = {}
 f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
 for line in f:
     values = line.split()
     word = values[0]
     coefs = np.asarray(values[1:], dtype='float32')
     embeddings_index[word] = coefs
 f.close()
  
 print('Found %s word vectors.' % len(embeddings_index))
  

Now let’s build an embedding matrix that we can load into an Embedding layer. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 isn’t supposed to stand for any word or token—it’s a placeholder.

Listing 11 Preparing the GloVe word embeddings matrix

  
 embedding_dim = 100
  
 embedding_matrix = np.zeros((max_words, embedding_dim))
 for word, i in word_index.items():
     embedding_vector = embeddings_index.get(word)
     if i < max_words:
         if embedding_vector is not None:
             # Words not found in embedding index will be all-zeros.
             embedding_matrix[i] = embedding_vector
  

DEFINE A MODEL

We’ll use the same model architecture as before:

Listing 12 Model definition

  
 from keras.models import Sequential
 from keras.layers import Embedding, Flatten, Dense
  
 model = Sequential()
 model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
 model.add(Flatten())
 model.add(Dense(32, activation='relu'))
 model.add(Dense(1, activation='sigmoid'))
 model.summary()
  

LOAD THE GLOVE EMBEDDINGS IN THE MODEL

The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with index i. Simple enough. Let’s load the GloVe matrix we prepared into our Embedding layer, the first layer in our model:

Listing 13 Loading the matrix of pre-trained word embeddings into the Embedding layer

  
 model.layers[0].set_weights([embedding_matrix])
 model.layers[0].trainable = False
  

Additionally, we freeze the embedding layer (we set its trainable attribute to False), following the same rationale as you’re already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding layer), and parts are randomly initialized (like our classifier), the pre-trained parts shouldn’t be updated during training to avoid forgetting what they already know. The large gradient updated triggered by the randomly initialized layers would be disruptive to the already learned features.

TRAIN AND EVALUATE

Let’s compile our model and train it:

Listing 14 Training and evaluation

  
 model.compile(optimizer='rmsprop',
               loss='binary_crossentropy',
               metrics=['acc'])
 history = model.fit(x_train, y_train,
                     epochs=10,
                     batch_size=32,
                     validation_data=(x_val, y_val))
 model.save_weights('pre_trained_glove_model.h5')
  

Let’s plot its performance over time:

Listing 15 Plotting results

  
 import matplotlib.pyplot as plt
  
 acc = history.history['acc']
 val_acc = history.history['val_acc']
 loss = history.history['loss']
 val_loss = history.history['val_loss']
  
 epochs = range(1, len(acc) + 1)
  
 plt.plot(epochs, acc, 'bo', label='Training acc')
 plt.plot(epochs, val_acc, 'b', label='Validation acc')
 plt.title('Training and validation accuracy')
 plt.legend()
  
 plt.figure()
  
 plt.plot(epochs, loss, 'bo', label='Training loss')
 plt.plot(epochs, val_loss, 'b', label='Validation loss')
 plt.title('Training and validation loss')
 plt.legend()
  
 plt.show()
  


Figure 4 Training and validation loss when using pre-trained word embeddings


Figure 5 Training and validation accuracy when using pre-trained word embeddings


The model quickly starts overfitting, unsurprisingly given the small number of training samples. Validation accuracy has high variance for the same reason, but seems to reach the high 50s.

Note that your mileage may vary: because we have few training samples, performance is heavily dependent on which exact 200 samples we picked, and we picked them at random. If it worked poorly for you, try picking a different random set of 200 samples, for the sake of the exercise (in real life you don’t get to pick your training data).

We can also try to train the same model without loading the pre-trained word embeddings and without freezing the embedding layer. In that case, we’d learn a task-specific embedding of our input tokens, which is generally more powerful than pre-trained word embeddings when lots of data is available. In our case, we’ve only 200 training samples. Let’s try it:

Listing 16 Defining a training the same model without pre-trained word embeddings

  
 from keras.models import Sequential
 from keras.layers import Embedding, Flatten, Dense
  
 model = Sequential()
 model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
 model.add(Flatten())
 model.add(Dense(32, activation='relu'))
 model.add(Dense(1, activation='sigmoid'))
 model.summary()
  
 model.compile(optimizer='rmsprop',
               loss='binary_crossentropy',
               metrics=['acc'])
 history = model.fit(x_train, y_train,
                     epochs=10,
                     batch_size=32,
                     validation_data=(x_val, y_val))
  


Figure 6 Training and validation loss without using pre-trained word embeddings


Figure 7 Training and validation accuracy without using pre-trained word embeddings


Validation accuracy stalls in the low 50s. In our case, pre-trained word embeddings outperforms jointly learned embeddings. If you increase the number of training samples, this quickly stops being the case—try it as an exercise.

Finally, let’s evaluate the model on the test data. First, we’ll need to tokenize the test data:

Listing 17 Tokenizing the data of the test set

  
 test_dir = os.path.join(imdb_dir, 'test')
  
 labels = []
 texts = []
  
 for label_type in ['neg', 'pos']:
     dir_name = os.path.join(test_dir, label_type)
     for fname in sorted(os.listdir(dir_name)):
         if fname[-4:] == '.txt':
             f = open(os.path.join(dir_name, fname))
             texts.append(f.read())
             f.close()
             if label_type == 'neg':
                 labels.append(0)
             else:
                 labels.append(1)
  
 sequences = tokenizer.texts_to_sequences(texts)
 x_test = pad_sequences(sequences, maxlen=maxlen)
 y_test = np.asarray(labels)
  

And let’s load evaluate the first model:

Listing 18 Evaluating the model on the test set

  
 model.load_weights('pre_trained_glove_model.h5')
 model.evaluate(x_test, y_test)
  

We get an appalling test accuracy of 56%. Working with only a handful of training samples is hard!

That’s all for this article. If you find yourself wanting to know more about Keras, go download the free first chapter of Deep Learning with Python and see this Slideshare Presentation.