 From Machine Learning with TensorFlow, Second Edition by Chris Mattmann This article covers: Building sentiment classifier using logistic regression and with softmax Measuring classification accuracy Computing ROC curve and measure classifier effectiveness Submitting your results to the Kaggle challenge for Movie Reviews

Take 40% off Machine Learning with TensorFlow, Second Edition by entering fccmattmann into the discount code box at checkout at manning.com.

Check out part 1 here to learn about using text and word frequency (Bag of Words) to represent sentiment

Building a sentiment classifier using logistic regression

When dealing with logistic regression you identify the dependent and independent variables. In sentiment analysis, your dependent variable is your 5000-dimensional feature vector Bag of Words per review, and you have 25,000 of them to train on. Your independent variable is the sentiment value: a one corresponding to a positive review from IMDB, or a zero corresponding to a user’s negative sentiment about the movie.

Have you noticed that the IMDB data that you use is the review and the sentiment, but no title? Where are the title words? Those words could factor into the sentiment if trigger words are used because to map the words that movie-goers use in the reviews, but overall you don’t need the titles, you only need a sentiment (something to learn) and a review.

Try to conceptualize the space of solutions that your classifier is exploring given the training data and feature space. You can imagine a vector plane – call it the ground – and call the vertical axis the elevation distance from the ground as if you were standing on it and looking up to the sky – the sentiment. On the ground plane you have a vector beginning at the origin from where you’re standing and proceeding in every which direction corresponding to a particular word from your vocabulary, 5000 axes shooting out a distance that corresponds to the count of that particular word that the vector describes. Data points in this plane are the specific counts on each of the word axes, and the y value is either a 1 or 0, depending on whether the collection of counts on each plane for a particular point implies the sentiment. You can imagine this looks similar to what you see in figure 3. Figure 1. Visualizing the construction of the classifier using logistic regression. Your feature space is the count of the words arranged as the plane three-dimensionally where the value is the occurrence count. The y-axis corresponds to the sentiment result,0 or 1.

Given this construction, we can represent the logistic regression equation that corresponds to this classifier using the following equations. Recall that the goal is to have a linear function with all of the dependent variables and their associated weights 1 through 5000 as the parameter to the sigmoid (sig) function, which results in a smooth curve that fluctuates between 0 and 1, which corresponds to the sentiment the independent variable:

M(x,w) = sig(wx + w)

sentiment = sig(w1x1 + w2x2 + … + w5000x5000 + w0)

Setting up the training for your model

It’s time to set up your TensorFlow logistic regression classifier. Begin with an arbitrary learning rate of 0.1 to start and train for 2000 epochs (which worked well on my laptop), because you’ll perform early stopping. Early stopping is a technique that measures the difference in loss (or error rate) between the prior epoch and the current epoch. If the error rate shifts between epochs by some minor threshold epsilon, it’s said that the model is stable and you can break early in your training.

You’ll set up your sigmoid function, which is needed for the model as well. Sigmoid is a function to ensure that the back-propagation process used to learn the appropriate model weights during each training step after applying the cost function has a smooth gradient step that fluctuates between 0 and 1. The sigmoid function is precisely that and has those properties.

Create the placeholders in TensorFlow for the Y values that you’ll learn, the sentiment labels, and also your placeholder for the X input 5,000 x 25,000-dimensional feature vector – one Bag of Words vector per movie review and 25,000 movie reviews. In listing 5 use a Python dictionary to store each Bag of Words vector, indexed X0-X4999. The w variable – the weights – is one for each dependent variable X, and additionally one constant w added at the end of the linear equation.

The cost function is a convex cross-entropy loss function and you’ll use gradient descent as your optimizer. The complete listing to set up the model is shown for your perusal in listing 5.

Listing 1. Setting up the training for the logistic regression sentiment classifier

```
learning_rate = 0.1 #A
training_epochs = 2000 #A
def sigmoid(x): #B
return 1. / (1. + np.exp(-x)) #B

Y = tf.placeholder(tf.float32, shape=(None,), name="y") #C
w = tf.Variable([0.] * (len(train_data_features)+1), name="w", trainable=True) #C

ys = train['sentiment'].values #D
Xs = {}
for i in range(train_data_features.shape):
Xs["X"+str(i)] = tf.placeholder(tf.float32, shape=(None,), name="x"+str(i))

linear = w
for i in range(0, train_data_features.shape):
linear = linear + (w[i+1] * Xs["X"+str(i)])
y_model = tf.sigmoid(linear) #E

cost = tf.reduce_mean(-tf.log(y_model * Y + (1 - y_model) * (1 - Y))) #F

```

#A Sets up the initial model hyperparameters for learning rate and number of epochs

#B Sets up the logistic regression model

#C Defines TensorFlow placeholders to inject the actual input and label values

#D Extracts the labels to learn from the Pandas dataframe

#E Constructs the logistic regression model to learn

#F Defines the cross-entropy cost function and train operation for each learning step

After setting up your model you can perform the training using TensorFlow. Y perform early stopping to save yourself useless epochs once the loss function and model response cost settle down.

Performing the training for your model

Create a `tf.train.Saver` to save the model graph and the trained weights to later reload them to make classification predictions with your trained model. The training steps look similar to what you’ve seen before: you initialize TensorFlow and this time you use TQDM to keep track of and incrementally print the progress in training to create an indicator. Note that this takes 30 to 45 minutes to train and it consumes gigabytes (Gb) of memory—at least it did on my fairly beefy Mac laptop—having TQDM lets you know how the training process is going and it’s a must-have.

The training step injects the 5000-dimensional feature vector into the X placeholder dictionary that you created and the associated sentiment labels into the Y placeholder variables from TensorFlow. Use your convex loss function as the model cost and compare the prior value of the cost in the last epoch with the current one to determine if your code should perform early stopping in the training and save precious cycles. The threshold value of `0.0001` was arbitrarily chosen but could be considered a hyperparameter to explore, given additional cycles and time. The full training process for the logistic regression sentiment classifier is shown in listing 6.

Listing 2. Perform the training step for the logistic regression sentiment classifier

```
saver = tf.train.Saver() #A
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
prev_err = 0. #B
for epoch in tqdm(range(training_epochs)):
feed_dict = {}
for i in range(train_data_features.shape):
feed_dict[Xs["X"+str(i)]] = train_data_features[:, i, None].reshape(len(train_data_features)) #C
feed_dict[Y] = ys #C
err, _ = sess.run([cost, train_op], feed_dict=feed_dict)
print(epoch, err)
if abs(prev_err - err) < 0.0001: #D
break #D
prev_err = err

w_val = sess.run(w, feed_dict) #E
save_path = saver.save(sess, "./en-netflix-binary-sentiment.ckpt") #F

print(w_val)
print(np.max(w_val))

```

#A Creates the saver to capture your model graph and associated trained weights

#B Used to capture the previous loss function value to test for early stopping

#C Provide the 25,000 review 5,000 dimensional feature vectors and the sentiment labels

#D Test if the previous loss value varies from the current loss value by some small threshold and it breaks if this happens

#E Obtain the trained weights as the model graph is still loaded

#F Save out the model graph and associated trained weights

You trained your first text sentiment classifier using logistic regression! Next, I’ll show you how to use it to make predictions against new unseen data, and you’ll also learn how to evaluate its accuracy and precision and get an overall feel for how well the classifier is performing by running it against the test data from the Kaggle competition and submitting your results to Kaggle!

Making predictions using your sentiment classifier

You built your classifier, but how do you use it to make predictions? Two key pieces of information are stored when you make that trusty call to `tf.train.Saver` which saves your checkpoint file.

• First, the checkpoint contains the model weights that you arrived at, in this case the weights of the `sigmoid` linear portion corresponding to each of the vocabulary words in your Bag of Words model.
• Second, the checkpoint contains the model graph and its current state in case you want to pick up where you left off and continue training its next epochs.

Making predictions is as simple as loading the checkpoint file and applying those weights to the model. Yuo don’t have to reuse your TensorFlow version of the `y_model` function from listing 5 – the `tf.sigmoid` function – because doing this loads the model graph and takes up additional resources to prepare TensorFlow to continue training. Instead you can apply the learned weights to a NumPy version of the model – the inline `sigmoid` function from listing 5 – because you won’t be doing any further training on it.

Seems pretty simple right? I left out one major thing that we need to cover first. You use it to make predictions and aid in your automated decisions. Consider the steps you performed for your model training, enumerated again for simplicity:

1. Data cleaning of 25,000 movie reviews
1. Strip HTML
2. Remove punctuation and only consider `a-zA-z`
3. Remove stop words
2. Apply Bag of Words model and limit vocabulary to 5,000-word feature vector
3. Use 25,000 vectors of size 5,000 and associated 25,000 labels for sentiment 1,0 and logistic regression to make classification model

Now, say you want to perform predictions using your generated model which creates new text, such as the following two sentences. The first sentence is clearly a negative review, and the second is a positive review.

```
new_neg_review = "Man, this movie really sucked. It was terrible. I could not possibly watch this movie again!"
new_pos_review = "I think that this is a fantastic movie, it really "

```

How do you provide these sentences to your model to make sentiment predictions? As it turns out, you need to apply at least the same data-preprocessing steps that you did during training to the prediction process; that way you’re considering the text the same way that you trained. You were training on 5,000-dimensional feature vectors, and you need to do the same thing to prepare the input text to make predictions. Additionally, you need to take heed of one more step. The weights generated during training were under the auspices of a common shared vocabulary of 5,000 words generated by the `CountVectorizer`. The unseen input text you’re making predictions with may have a different vocabulary than your trained text vocabulary. It may use other words, perhaps more or less, than you trained on. And yet, you spent nearly forty-five minutes training your logistic-regression sentiment classifier and perhaps even longer preparing the input and labels for that training. Is that work now invalidated? Do you have to perform training all over again?

Remember I mentioned earlier that choosing a value of 5,000 for the vocabulary size in `CountVectorizer` allows for sufficient richness, but it’s something that you may have to tune or explore to get the best fit. Vocabulary size matters and what you predict with your trained model does as well. 5,000 words in a vocabulary left after preprocessing steps and data cleaning can achieve high accuracy during training and on unseen data – as high as 87% in my training, which you’ll reproduce later using Receiver Operating Characteristic (ROC) curves. But who’s to say ten thousand words wouldn’t have achieved even higher accuracy? Not me!

Indeed, using more words in your vocabulary may achieve a higher accuracy but it depends on what data you intend to make predictions on and the generality of that data. It’s also a big influence on your overall memory and CPU and GPU requirements for training, because using more features per each of the input vectors takes more resources. Note that if your unseen data overlaps sufficiently in terms of its vocabulary with the representative vocabulary from your training data then there’s no need to increase its size. Figuring out the optimal vocabulary size is an exercise best left for a semester-long statistics or NLP graduate course, but suffice to say this is a hyperparameter that you may want to explore further. For example, this posting on Stackoverflow has some nice pointers to considerations for vocabulary size: https://stackoverflow.com/questions/46118910/scikit-learn-vectorizer-max-features.

Either way, to move forward and implement the sentiment-prediction function that we’re focused on, you need to figure out the overlap of the vocabulary vector from your new text and its vector with that of the existing vocabulary words from training. Then you consider only the counts in your Bag of Words model for prediction for those overlapping terms in the fingerprint for the new text that you want to predict sentiment on, compared with that of what you trained on across all of your training set. The entire prediction pipeline and its relationship with that of the training pipeline are illustrated visually in figure 4. Figure 2. Making predictions using machine learning. During training (top) you preprocess the input data by cleaning the text, and converting it to a 5,000-dimensional feature vector and using that to learn sentiment labels (1 or 0) using 25,000 movie reviews. To make predictions with the learned model (right side), you need to perform the same data cleaning steps and in addition figure out the overlap of next text and its vocabulary with your trained one.

Let’s start writing the `predict` function. It should take the unmodified review text as input along with the training vocabulary and learned weights from the training process. As I mentioned you need to apply the same data cleaning process, and you clean the text by doing the following:

1. tokenizing it
2. removing punctuation and non-characters
3. removing stop words
4. rejoining the tokens back together.

Afterward, apply the Bag of Words model again and generate a function with which you can take unseen input text and then make sentiment predictions on. The function should focus on figuring out the overlap of the new learned vocabulary words from the input with those of the training vocabulary. For each overlapping word you consider the word count from that, and all other elements in the feature vector are zero. The resultant feature vector should then be fed to the sigmoid function for your logistic regression model, using the optimal learned weights. Then the result – a probability between 0 and 1 of its sentiment – is compared with a threshold value of 0.5 to determine if the sentiment is 1 or 0. The full listing for the predict function appears in listing 7.

Listing 3. Making predictions using the logistic regression sentiment classifier

```
def predict(test_review, vocab, weights, threshold=0.5): #A

test_review_c = review_to_words(test_review) #B

n_vectorizer = CountVectorizer(analyzer = "word",   \ #C
tokenizer = None,    \
preprocessor = None, \
stop_words = None,   \
max_features = 5000)
ex_data_features = n_vectorizer.fit_transform([test_review_c])
ex_data_features = ex_data_features.toarray() #D
test_vocab = n_vectorizer.get_feature_names()#D
test_vocab_counts = ex_data_features.reshape(ex_data_features.shape) #D

ind_dict = dict((k, i) for i, k in enumerate(vocab)) #E
test_ind_dict = dict((k, i) for i, k in enumerate(test_vocab)) #E
inter = set(ind_dict).intersection(test_vocab) #E
indices = [ ind_dict[x] for x in inter ] #E
test_indices = [test_ind_dict[x] for x in inter] #E

test_feature_vec = np.zeros(train_data_features.shape) #F
for i in range(len(indices)): #F
test_feature_vec[indices[i]] = test_vocab_counts[test_indices[i]] #F

test_linear = weights #G
for i in range(0, train_data_features.shape): #G
test_linear = test_linear + (weights[i+1] * test_feature_vec[i])
y_test = sigmoid(test_linear) #G

return np.greater(y_test, threshold).astype(float) #H

```

#A predict takes review text to test, the train vocabulary, learned weights and the threshold cutoff for a positive or negative prediction as parameters

#B clean the review using the same function you used for training

#C create the test vocabulary and counts

#D convert to a NumPy array of vocabulary and counts

#E figure out the intersection of the test vocabulary from the review with the full vocabulary

#F all zeros for the 5000-feature vector except for the overlap indices that we have counts for

#G apply your logistic regression model with the learned weights

#H if the predicted probability is greater than 0.5 it’s a sentiment of 1 or 0

Go ahead and try the function on the two test reviews `new_neg_review` and `new_pos_review`! I’ve copied them again below. You can see it properly predicts the negative review as a 0 and the positive review as a 1. Cool, right?

```
new_neg_review = "Man, this movie really sucked. It was terrible. I could not possibly watch this movie again!"
new_pos_review = "I think that this is a fantastic movie, it really "
predict(new_neg_review, vocab, w_val)
predict(new_pos_review, vocab, w_val)

```

Now that you have a predict function you can use it to compute a confusion matrix. Creating a confusion matrix of true-positives, false-positives, true-negatives and false-negatives allows you to measure the classifier’s ability to predict each class and compute precision and recall. Additionally, you can generate an ROC curve and test how much better your classifier is than the baseline.