From Machine Learning with TensorFlow, Second Edition by Chris Mattmann
This article covers using text and word frequency (Bag of Words) to represent sentiment.
Take 40% off Machine Learning with TensorFlow, Second Edition by entering fccmattmann into the discount code box at checkout at manning.com.
One of the magic uses of machine learning that impresses everyone nowadays is teaching the computer to learn from text. With social media, SMS text, Facebook messenger, WhatsApp, Twitter and other sources generating hundreds of billions of text messages a day, there’s no shortage of text to learn from.
SEE FOR YOURSELF Check out this famous infographic demonstrating the abundance of textual data arriving each day from various media platforms: https://www.textrequest.com/blog/how-many-texts-people-send-per-day/.
Social media companies, phone providers, and app makers try to use the messages you send to make decisions and classify you. Have you ever sent your significant other an SMS text message about the Thai food you ate for lunch and then later saw ads on your social media pop up recommending new Thai restaurants to visit? Scary as it seems that big brother is trying to identify and understand your food habits, there are also practical applications used by online streaming service companies trying to determine if you enjoyed their films or not.
After watching a film, have you ever taken the time to issue a simple, “Wow that was a great movie! Loved Bill’s performance!” or, “That movie was grossly inappropriate, was well over three hours and as such after first being disgusted by the gore, I fell asleep because there was no plot!” (Ok, admittedly, I may have authored that last comment on some online platform.) YouTube is famous for other users coming not only to watch the videos and viral content, but to engage in the act of reading the comments, or looking at the written reviews of content for movies, videos, and other digital media. These reviews are simple in the sense that you can fire and forget a quick sentence or two, get your feelings out, and move on with your life. Sometimes these comments are hilarious, sometimes angry, sometimes extremely positive and ultimately run the gamut of emotions that online participants could have as generated from viewing the content.
Those emotions and sentiment are quite useful to online media service companies. Given an easy way to classify sentiment the companies could determine if a particular video of a celebrity generated extreme sadness, or instead if it caused users to respond extremely positively. In turn, if the companies could first classify and then associate those emotions with what you did next; for example if after watching a movie that you provided a positive few sentences on in commentary, you then clicked a link to buy more movies also starring that actor, then they’d have the whole cause-and-effect pipeline. The media company could then either generate more of that content, or show you more of the types of content that you are interested in. Doing this may generate increased revenue—for example, if your positive reaction led you to purchase something about that celebrity afterwards.
As it turns out, there’s a methodology for using machine learning to perform classification on input data and by classifying it to generate some label for that input. Sentiment can be thought of in two ways to perform classification: binary sentiment, for example, positive / negative reaction, and multi-class sentiment; for example, hate, sad, neutral, like, love. Two techniques to handle those cases, which you’ll try out in this article:
- logistic regression for the binary sentiment
- softmax regression for multi-class
The challenge with the input in this case is that it’s text and not some nice input vector of numbers like the randomly generated data points of a trusty NumPy library Lucky for you the text and information retrieval community has developed a technique to handle mapping text to a numerical feature vector perfect for machine learning called the Bag of Words model. Let’s learn about it next.
The Bag of Words model
The Bag of Words model is a method from natural language processing (NLP) that takes as input text in the form of a sentence and turns it into a feature vector by considering the extracted vocabulary words and the frequency of their occurrence. Named such because each word frequency count is like a “bag,” with each occurrence of a word as an item in that bag, the Bag of Words model is a state-of-the-art method to take, for example, a review of a movie and convert it into a feature vector, which you need to classify its sentiment. Consider the following review snippet text written about a recent Michael Jackson movie:
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again.
The first step in applying the Bag of Words model to processing this review is to preprocess the text and extract only the words with meaning. This usually means that you remove any non-letter character such as numbers or additional annotations, including HTML tags, or apostrophes, and generally strip the text down to its bare words. After that, the approach reduces the remaining words in the subset to those that are nouns or verbs or adjectives, and takes out articles, conjunctions, and other stop-words or words that aren’t distinguishing features of the text itself.
NOTE You can find many canned stop-word lists, for example, those used by Python’s Natural Language Toolkit (NLTK) are a good starting point: https://gist.github.com/sebleier/554280. Stop words are usually language-specific, and you want to make sure whichever list you use suits the language you are processing. Lucky for you, NLTK presently handles stop-words from twenty-one languages, and you can read more about it at https://stackoverflow.com/questions/54573853/nltk-available-languages-for-stopwords.
Once that step is complete, the Bag of Words model generates a count histogram of the remaining vocabulary words and that histogram becomes the fingerprint for the input text. Oftentimes the fingerprint is normalized by dividing the counts by the max count, resulting in a feature vector of values between 0 and 1. The whole process is shown in figure 1.
Figure 1. A visual depiction of the Bag of Words model. Text is analyzed, cleaned, and words are counted to form a histogram, which is then normalized to obtain a feature vector representation of the input text.
Applying the Bag of Words model to Movie Reviews
To get started with the Bag of Words model you’ll need some review text. The Kaggle Bag of Words Meets Bags of Popcorn challenge is an excellent already-completed competition that looked at 50,000 movie reviews from the Internet Movie DataBase (IMDB), and looked to generate a sentiment classification from those movie reviews. You can read more about the challenge here https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words. You’ll use those reviews in this article to build our sentiment classifiers.
To get started, grab the
labeledTrainData.tsv file from https://www.dropbox.com/s/oom8kp7c3mbvyhv/labeledTrainData.tsv?dl=0 and save it to your local drive. You’ll also want to download the
testData.tsv file from https://www.dropbox.com/s/cjhix4njcjkehb1/testData.tsv?dl=0, which you’ll use later. The files are formatted as tab-separated values (TSV) with the columns corresponding to a unique identifier (id), the sentiment (1 for positive or 0 for negative), and the review itself in HTML format, per row.
Now, let’s try out our Bag of Words model and create a function to handle creating machine-learning ready input features from the input
labeledTrainData.tsv file. Open a new notebook called
sentiment_classifier.ipynb and then create a
review_to_words function. The first thing that the function does is convert an HTML review from IMDB into review text by calling the Tika Python library. Tika Python is a content analysis library whose main functionalities include file type identification, text and metadata extraction from over 1400 formats, and language identification.
Additional reading A full explanation of Tika is the subject of another Manning book written by me. Seriously, check out Tika in Action. Use it to take HTML and strip out all the tags into text using the
parser interface and its
from_buffer method which takes as input a string buffer and outputs the associated extracted text from the HTML parser.
With the extracted review text in hand, use Python’s
re module (for regular expression) to use a common pattern
[^a-zA-z], which means literally start from the beginning of the string (the ^ symbol) and then scan and identify only upper and lowercase letters a through z and for everything else replace the value with a whitespace character.
The next step is to convert the text all to lowercase because word casing has meaning when interpreting a sentence or language but little meaning when you count the word occurrences independent of the structure. Stop words, including conjunctions, and articles, are removed next using Python’s NLTK library. You’ll recall that it has support for stop-words from twenty-one languages, and you’ll use the ones for English because these are all from IMDB’s English reviews. The final step is to take the remaining words and join them as a string. The output of this listing is a thinned down version of the original listing with only the meaningful words and no HTML – clean text. That clean text is the input put into the Bag of Words model.
Listing 1. Creating features from the input text of the reviews
from tika import parser from nltk.corpus import stopwords import re def review_to_words( raw_review ): review_text = parser.from_buffer( "<html>" + raw_review + "</html>" )["content"]#A letters_only = re.sub("[^a-zA-Z]", " ", review_text) #B words = letters_only.lower().split()#C stops = set(stopwords.words("english"))#D meaningful_words = [w for w in words if not w in stops] #E return( " ".join( meaningful_words ))#F
#A Function to convert a raw review to a string of words using Apache Tika
#B Removes non-letters
#C Convert to lower case, split into individual words
#D Convert stop words to a set which is much faster than searching list
#E Remove stop words
#F Join the words back into one string separated by space
Armed with our function to generate clean review text you can get started running the function over the 25,000 reviews in
labeledTrainData.tsv. But first you need to load those reviews into Python.
Cleaning all the movie reviews
A handy library to take a TSV and to load it into Python efficiently is the Pandas library for creating, manipulating, and saving data frames. You can think of a data frame as a table which is machine-learning ready. Each column in the table is one of the features you can use in machine learning, and the rows are input for training or testing. Pandas provides functions for adding and dropping feature columns, and for augmenting and replacing row values in sophisticated ways. Pandas is the subject of many books (no I didn’t write those, other authors did!) and Google provides tens of thousands of hits on the subject, but for your purposes here you can use Pandas to create a machine-learning ready data frame from the input TSV file. Pandas can then help to inspect the number of features and rows and columns in your input.
With that data frame, you run your review-text cleaning code to generate clean reviews which is applied to the Bag of Words model. First, call the Pandas
read_csv function, and tell it that you’re reading a TSV file with no header row, with the tab character (\t) as the delimiter, and you don’t want it to quote the feature values. Once the train data is loaded, print its shape and column values demonstrating the ease that you can use Pandas to inspect your data frame.
Because cleaning 25,000 movie reviews can take time, use Python’s TQDM helper library to keep track of your progress. TQDM is an extensible progress bar library that prints status either to the command line or to a Jupyter notebook. You wrap your iteration step – the
range function in listing 2 – as a
tqdm object and then every iteration step causes a progress bar increment to be visible to the user, either via the command line or in a notebook. TQDM is a great way to fire and forget a long-running machine-learning operation, and still know that something’s going on when you come back and check on it.
Listing 2 prints the training shape
(25000, 3) corresponding to 25,000 reviews and 3 columns (id, sentiment and review), and the output
array(['id', 'sentiment', 'review'], dtype=object) corresponding to those column values. Add the code in listing 2 to your
sentiment_classifier.ipynb notebook to generate 25,000 clean text reviews and keep track of the progress.
Listing 2. Use Pandas to read the movie reviews and apply your cleaning function
import pandas as pd from tqdm import tqdm_notebook as tqdm train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3) #A print(train.shape) #B print(train.columns.values) #B num_reviews = train["review"].size #C clean_train_reviews =  #D for i in tqdm(range( 0, num_reviews )): #E clean_train_reviews.append( review_to_words( train["review"][i] ) )
#A Read the 25,000 reviews from the input TSV file
#B Prints the shape of the training data and number of values
#C Get the number of reviews based on the dataframe column size
#D Initialize an empty list to hold the clean reviews
#E Loop over each review and clean it using your function
Now that the reviews are clean, it’s time to apply the Bag of Words model. Python’s SK-learn library (http://scikit-learn.org/) is an extensible machine-learning library that provides a lot of complementary features to TensorFlow. Even though some of the features are overlapping, I often use SK-learn’s data cleaning functions. You don’t have to be a purist, and this is something I recommend. For example, SK-learn comes with a fantastic implementation of Bag of Words called
CountVectorizer, which you’ll use in listing 3 to apply the Bag of Words model.
First create the
CountVectorizer with some initial hyperparameters. These tell SK-learn whether you want it to do any text analysis such as tokenization, preprocessing, or removal of stop words. I omit that here because you’ve already written your own text-cleaning function in listing 1 and applied it in 2 to the input text.
One parameter of note is
max_features, which controls the size of the learned vocabulary from the text. Choosing a size of five thousand ensures that the TensorFlow model we build has sufficient richness and that the resulting Bag of Words fingerprints for each review can be learned, without exploding the amount of RAM on your machine. Obviously, this is an example of parameter tuning that you can play around with later given larger machines and more time. A general rule of thumb is a vocabulary on the order of thousands should provide sufficient learnability for English movies, but for news, scientific literature, and other domains you may need to experiment to find an optimal value.
fit_transform to provide the clean reviews you generated in listing 2 and get back the vectorized Bag of Words, one row per review with the row contents being the count per vocabulary word per review. Then convert the vector into a NumPy array and print its shape and ensure that you see
(25000,5000) corresponding to 25000 input rows with 5000 features per row. Add the code from listing 3 to your notebook.
Listing 3. Apply the Bag of Words model to obtain your training data
from sklearn.feature_extraction.text import CountVectorizer #A vectorizer = CountVectorizer(analyzer = "word", \ #A tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) train_data_features = vectorizer.fit_transform(clean_train_reviews) #B train_data_features = train_data_features.toarray() #C print(train_data_features.shape) #D
#A imports the CountVectorizer and instantiates the Bag of Words model
#B Fits the model and learns the vocabulary and transforms training data into vectors
#C Converts the results to a NumPy array
#D Prints the resultant input feature shape (25000,5000)
Exploratory Data Analysis on your Bag of Words
Doing some exploratory data analysis is always a good thing and you may want to inspect the values of the vocabulary returned from
CountVectorizer to get a feel for what words are present across all the reviews. You’ll want to convince yourself that there’s something to learn here and what you’re looking for is some statistical distribution across the words that gives a sign that the classifier you’re going to build can pick up on. If all the counts are the same in every review and you can’t eyeball a difference between them, the machine-learning algorithm has the same difficulty!
The great part about SK-learn and
CountVectorizer is that not only does it provide a one- or two-line API call to create the Bag of Words output, but it allows for easy inspection of the result. For example, you can get the vocabulary words learned and print them, count up their size using a quick NumPy sum method to bin by word, and then take a look at the first one hundred words and their sums across all reviews. The code to perform this is in listing 4.
Listing 4. Exploratory data analysis regarding the returned Bag of Words
vocab = vectorizer.get_feature_names() #A print("size %d %s " % (len(vocab), vocab)) #A dist = np.sum(train_data_features, axis=0) #B for tag, count in zip(vocab, dist): #C print("%d, %s" % (count, tag)) #C plt.scatter(vocab[0:99], dist[0:99]) #D plt.xticks(vocab[0:99], rotation='vertical') #D plt.show()
#A Gets the learned vocabulary and prints out its size and the learned words
#B Sums up the counts of each vocabulary word
#C For each, print the vocabulary word and the number of times it appears in the training set
#D Plot the word count for first 100 words
The output set of words printed is visualized in figure 2 for the first one hundred words for all 25,000 reviews. I could arbitrarily picked any random set of one hundred words from the vocabulary, but to keep it simple use the first one hundred. Even in the first one hundred words is a statistical significance in the count of those words across reviews because the counts aren’t all the same and it lacks uniformity. Some words are used more often than others – and there are some obvious outliers, making it look like there’s signal for a classifier to learn.
Figure 2. Vocabulary counts summed across all 25,000 reviews of the first one hundred words in the extracted 5000-word vocabulary.
In part 2, we will get started building our logistic regression classifier. Stay tuned!
That’s all for this article.
If you want to learn more about the book, you can preview its content on our browser-based liveBook platform here.