From Machine Learning for Business by Doug Hudgeon and Richard Nichol
This article explores the amazing potential of natural language processing and some of its associated difficulties.
Natural Language Processing
The goal of NLP is to be able to use computers to work with language as effectively as computers can work with numbers or variables. This is a hard problem because of the richness of language. The previous sentence is a good example of the difficulty of this problem. The term rich means something slightly different when referring to language than it does when referring to a person. And the sentence Well, that’s rich! may mean the opposite of how rich is used in other contexts.
Scientists have worked on NLP since the advent of computing but it has only been recently that they’ve made significant strides. Up until recently, NLP focused on getting computers to understand the structure of each language. In English, a typical sentence has a subject, verb, and object, such as the sentence Sam throws the ball; whereas Japanese typically follows a subject, object, verb pattern. But the success of this approach was hampered by the mind-boggling number and variety of exceptions and slowed by the necessity to individually describe each different language-the same code you use for English NLP won’t work for Japanese NLP.
The big breakthrough in NLP occurred in 2013 when Tomas Mikolov from Google published a paper on word vectors. In this approach, you don’t look at parts of language at all. You tend to apply mathematical algorithms and text annotation tools to a bunch of text and work with the output. This has two advantages:
- It naturally handles exceptions and inconsistencies in language.
- It’s language-agnostic and can work with Japanese text as easily as it can work with English text.
In Sagemaker, working with word vectors is as easy as working with data, but there are a few decisions you need to make when configuring Sagemaker which require you to have some appreciation of what’s happening under the hood.
Creating word vectors
Using the Pandas function get-dummies to convert categorical data such as Desk, Keyboard and Mouse to a wide dataset, the first step in creating a word vector is to convert all of the words in your text into wide datasets. As an example, the word queen is represented by the dataset 0,1,0,0,0 shown in figure 1. The word queen has 1 under it whilst every other word in the row has a 0. This can be described as a single dimensional vector. Using a single dimensional vector, you can test for equality and nothing else. You can determine whether the vector is equal to the word Queen – and in figure 1, you can see that it is.
Figure 1. One-hot encoding
Mikolov’s breakthrough was the realization that meaning can be captured by a multidimensional vector with the representation of each word distributed across each dimension. Figure 2 shows conceptually how dimensions look in a vector. Each dimension can be thought of as a group of related words. In Mikolovs’ algorithm, these groups of related words don’t have labels but, to show how meaning can emerge from multidimensional vectors, I’ve provided four labels on the left side of the figure: Royalty, Masculinity, Femininity, and Elderliness.
Looking at the first dimension, Royalty, you can see that the values in the King, Queen and
Princess columns are higher than the values in the Man and Woman columns. Whereas for Masculinity, the values in the King and Man columns are higher than in the others. From this you start to get the picture that a King is Masculine Royalty whereas a Queen is non-masculine Royalty. If you imagine working your way through hundreds of vectors, you can see how meaning can emerge.
Figure 2. One-hot encoding
The magic of the mathematics behind word vectors is that it organizes defined word groups. Each of these groups is a dimension in the vector. For example, in the above tweet where the tweeter says no one is responding as usual, the words as usual might be grouped into a dimension with other pairs of words such as of course, yeah obviously and a doy that indicate frustration.
The King / Queen, Man / Woman example is used regularly in the explanation of word vectors. Adrian Colyer’s excellent blog the morning paper discusses word vectors in more detail at https://blog.acolyer.org/2016/04/21/the-amazing-power-of-wordvectors/. Figures 1 and 2 are based on figures from the first part of this article. If you’re interested in exploring this topic further, the rest of the Adrian’s article is a good place to start.
Deciding how many words to include in each group
In order to work with vectors in Sagemaker, the only decision you need to make is whether Sagemaker should use single words, pairs of words or word triplets when creating the groups. For example, if Sagemaker used word pair, as usual, it may get better results than if it uses the single word as and the single word usual because the word pair as usual expresses a different concept than the individual words express.
In my work, I normally use word pairs but I’ve occasionally gotten better results from triplets. In one project where we were extracting and categorizing marketing terms, using triplets resulted in much higher accuracy probably because marketing fluff is often expressed in word triplets such world class results, high powered engine and fat burning exercise.
NLP uses the terms unigram, bigram and trigram for single, double, and triple word groups. Figures 3, 4 and 5 show examples of single word (unigram), double word (bigram), and triple word (trigram) word groups.
Unigrams are single words. Unigrams work well when word order isn’t important. For example, if you were creating word vectors for medical research, unigrams do a good job of identifying similar concepts.
Figure 3. Unigrams
Bigrams are pairs of words. Bigrams work well when word order is important, such as in
sentiment analysis. The bigram as usual conveys frustration but the unigrams as and usual do not.
Figure 4. Bigrams
Trigrams are groups of three words. In practice, I don’t see much improvement in moving from bigrams to trigrams but on occasion there can be. One project I worked on identifying marketing terms delivered significantly better results using trigrams probably because the trigrams better captured the common patter hyperbole noun noun as in greatest coffee maker and hyperbole adjective noun as in fastest diesel car.
Figure 5. Trigrams
The machine learning application uses an algorithm called BlazingText to predict whether a tweet should be escalated.
What is BlazingText and how does it work?
The machine learning algorithm I’m using in this article is called BlazingText. It’s a version of an algorithm called fastText developed by researchers at Facebook in 2017, and fastText is a version of the algorithm developed by Mikolov.
Figure 6 shows the workflow after BlazingText is put into the workflow. In step 1, a tweet is sent by a person requiring support. In step 2, BlazingText decides whether the tweet should be escalated to a person to respond. In step 3, the tweet is escalated to a person (step 3a) or handled by a bot (step 3b).
Figure 6. BlazingText workflow
In order for BlazingText to decide whether a tweet should be escalated, it needs to determine whether the person sending the tweet is frustrated. To do this, BlazingText doesn’t need to know whether the person is frustrated or even understand what the tweet was about. It only needs to determine how similar the tweet is to other tweets that have been labelled as frustrated or not frustrated.
With that as background, you’re ready to start building the model.
A note regarding SageMaker: When you first set up SageMaker you create a Notebook Instance. This is a server that AWS configures to run your Notebooks. We recommend selecting a medium-sized server instance because it has enough grunt to do anything we covered in this article–more on this can be found in appendix C. In your own work, as you work with larger datasets, you may need to use a larger server as your notebook server.