From Natural Language Processing in Action by Hobson Lane, Cole Howard, and Hannes Hapke

In this article, you will learn about tokenization in Natural Language Processing.

Check out this github repository and pypi package where we put code and data from the examples Natural Language Processsing in Action. It’s at http://github.com/totalgood/nlpia and it can be installed with pip install nlpia on any computer with python.


Save 37% on Natural Language Processing in Action. Just enter code fcclane into the discount code box at checkout at manning.com.


Build Your Vocabulary

So you’re ready to save the world with the power of Natural Language Processing? Well the first thing you need is a powerful vocabulary. This article will help you split a document or any string, into discrete tokens of meaning.

Our tokens are going to be limited to words, punctuation marks, and numbers, but the techniques we use are easily extended to any other units of meaning contained in a sequence of characters, like ASCII emoticons, unicode emojis, mathematical symbols, etc. Retrieving tokens from a document will require some string manipulation beyond just the str.split() method. We’ll need to separate punctuation from words and we’ll need to split contractions like “we’ll” into the words that were combined to form them.

Think for a moment about what a word or token represents to you in your mind. Does it represent a single concept, or some blurry cloud of concepts? Could you be sure you could always recognize a word? Are natural language words like programming language keywords, which have precise definitions and grammatical usage rules? Could you write software that could recognize a word? Is “ice cream” one word or two to you? Don’t both words have entries in your mental dictionary? What about the contraction “don’t”? Should that string of characters be split into one or two “packets of meaning?” Can you think of additional words that are implied by the single-word command “Don’t!?”

Let’s take a look at some approaches to using feature extraction. You’ll find that the approaches we show are not perfect. Feature extraction can rarely retain all of the information content of the input data in any machine learning pipeline, but, in NLP, composing a numerical vector from text is a particularly “lossy” feature extraction process. Nonetheless, the bag-of-words vectors retain enough of the information content of the text to produce useful and interesting machine learning models.

As an example of why feature extraction from text is hard, consider stemming, i.e. trying to identify words with similar meanings based on the characters they contain. Very smart people have spent their careers developing algorithms for grouping similar words together based only on their spelling. Imagine how difficult that is. Imagine trying to remove verb endings like “ing” from “ending” but not from “sing.” Or imagine trying to discriminate between a pluralizing “s” at the end of words like “words” and a normal “s” at the end of words like “bus” and “lens.” Do isolated individual letters in a word or parts of a word provide any information at all about that word’s meaning? Can the letters be misleading? Yes and yes. We’ll show you how to make your NLP pipeline a bit smarter by dealing with these word spelling challenges using conventional stemming approaches for now.


Building your vocabulary through tokenization

In NLP, tokenization is a particular kind of document segmentation. Segmentation breaks up text into smaller chunks or segments, with more focused information content. Segmentation can include breaking a document into paragraphs, paragraphs into sentences, sentences into phrases, or phrases into words and punctuation. Let’s focus on segmenting text into tokens. This is called “tokenization.”

Tokenization is a powerful tool. It breaks unstructured data, in this case text, into chunks of information which can be counted as discrete elements. These counts of token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a structured, numerical data structure suitable for machine learning. These counts can be used directly by a computer to trigger useful actions and responses, or they may be used in a machine learning pipeline as features driving more complex behavior. The most common use for bag-of-words vectors created this way is for document retrieval, or search.

The simplest way to tokenize a sentence is to use white space within a string as the “delimiter” of words. In Python, this can be accomplished with the standard library method split, which is available on all str objects as well as on the str built in type itself:

  
>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> sentence.split()
['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']
>>> str.split(sentence)
['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']
 

Figure 1. Jefferson1


As you can see, this built-in Python method already does a decent job tokenizing a simple sentence. Its only “mistake” was on the last word, where it included the sentence-ending punctuation with the token “26.” For now let’s forge ahead with our 90% tokenization solution.

With a bit more Python you can create a numerical vector representation for each word called a one-hot word vector. A sequence of these one-hot word vectors fully captures the original document text in a numerical data structure!

  
>>> import numpy as np
>>> vocab = sorted(set(sentence.split()))
>>> wordvecs = np.zeros((len(vocab), len(vocab)), int)
>>> for i, word in enumerate(sentence.split()):
...     wordvecs[i, vocab.index(word)] = 1
>>> ' '.join(vocab)
'26. Jefferson Monticello Thomas age at began building of the'
>>> wordvecs
array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
 

In this representation of a sentence, each row or line is a word. The “1” in a column means that is the word represented by that row or vector. Because these are “one-hot” vectors, all the other positions (for the other words in our vocabulary) are set to zero. A one (1) means “on” or “hot”. A zero (0) mean “off” or absent.

One nice feature of this particular vector representation of words is that no information is lost. As long as we keep track of which words are indicated by which column, we can reconstruct the original document from the sequence of one-hot word vectors generated this way. This detokenization process is 100% accurate even though our tokenizer was only 90% accurate at generating the tokens we thought would be useful! As a result, one-hot word vectors like this are typically used in neural nets, sequence-to-sequence language models, and generative language models – any model that needs to retain all of the meaning inherent to the order of words in the original text.

If you squint hard enough you might be able to imagine that the matrix of ones and zeros above looks a bit like a player piano paper roll. The vocabulary key at the top tells the machine which “note” (word) to play at each line. Our mechanical pianist is only allowed to use one “finger” at a time and can only play one note (word) melodies with a consistent beat. We can play back the original sentence to provide input for a neural net or any other machine learning pipeline. Or we could just play a sequence of word vecs back to generate text for a chat bot, just like a player piano might play a song for a less-artificial intelligence. Now all we need to do is figure out how to build a player piano that can “understand” and combine those word vectors in new ways to play us a song (say something) we haven’t heard before.

This representation of a sentence in one-hot word vectors retains all of the detail, grammar, and order of the original sentence and puts it into a form that a computer can understand. But, if we created a matrix like this, maintaining a consistently ordered vocabulary key for all the sentences in a single book, it would not be very practical. In most cases, the vocabulary of tokens that will be useful for an NLP pipeline is much more than ten thousand, and sometimes hundreds of thousands of tokens long. So, if there are a million tokens in your vocabulary, and you want to process, say, 3000 books with 3500 sentences each (an average amount of sentences for a book), you’re talking more than a million million bits, tens of gigabytes. Storing all those zeros, and trying to remember the order of the words doesn’t make much sense.

What if our documents are much shorter than an entire book, like a sentence typed into a chat application or the responses generated by a chatbot. Furthermore, what if we assumed that most of the meaning of a sentence can be gleaned from just the words themselves, jumbled up in a “bag” where we no longer keep track of their order? That turns out to be a reasonable assumption. Even for documents several pages long, a bag-of-words representation is still useful at summarizing and compressing the information content into a reasonably-sized data structure.

If we ORed all these one-hot vectors together, rather than playing them one at a time, we’d get a binary bag of words vector that we could use to represent the whole sentence in a single, reasonable-length vector (only as long as our vocabulary). Just like how laying your arm on the piano and hitting all the notes (words) at once doesn’t make for a pleasant, meaningful experience. Nonetheless, this approach turns out to be critical to helping a machine “understand” a whole group of words as a unit.

Fortunately, the words in our vocabulary are sparsely utilized in any given text. So rather than hitting all the notes on a piano at once, our bag-of-words vector is more like a broad and pleasant piano chord, a combination of notes (words) that work well together, containing structure and meaning. Our chatbot can handle these chords even if there’s a lot of “dissonance” from words in the same statement that aren’t normally used together. Even dissonance (odd word usage) is information about a statement that can be used within a machine learning pipeline.

Here’s how we can put the tokens into a binary vector indicating the presence or absence of a particular word in a particular sentence. This vector representation of a set of sentences could be “indexed” to indicate which words were used in which document. This index is equivalent to the index you find at the end of many textbooks, except that instead of keeping track of which page a word occurs on, we can keep track of the sentence (or the associated vector) where it occurred. We can also keep track of every single word (at least for now), whereas a textbook index generally only cares about important words relevant to the subject of the book, i.e. words that a reader may want to try to find later.

Here’s what our single text document, the sentence about Thomas Jefferson, looks like as a binary bag-of-words vector.

  
>>> sentence_bow = {}
>>> for token in sentence.split():
...     sentence_bow[token] = 1
>>> sorted(sentence_bow.items())
[('26.', 1)
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]
 

One thing you might notice is that Python’s sorted() puts decimal numbers before characters, and capitalized words before lowercase words. This is the ordering of characters in the ASCII and unicode character sets. Capital letters come before lowercase letters in the ASCII table. The order of our vocabulary is unimportant, and, as long as we are consistent across all the documents we tokenize this way, a machine learning pipeline will work equally well with any vocabulary order.

You might also notice that using a dict (or any paired mapping of words to their 0/1 values) to store a binary vector shouldn’t waste much space. Using a dictionary to represent our vector ensures that it only has to store a 1 when any one of the thousands, or even millions, of possible words in our dictionary appear in a particular document. You can see how it would be much less efficient to represent a bag of words as a continuous list of 0’s and 1’s with an assigned location in a “dense” vector for each of the words in a vocabulary of, say, 100,000 words. This dense binary vector representation of our “Thomas Jefferson” sentence would require 100 kB of storage. Because a dictionary “ignores” the absent words, the words labeled with a 0, the dictionary representation only requires a few bytes for each word in our 10-word sentence. Additionally, this dictionary could be made even more efficient if we represented each word as an integer pointer to each word’s location within our lexicon–the list of words that makes up our vocabulary for a particular application.

Let’s use an even more efficient form of a dictionary, a Pandas Series, and we’ll wrap it up in a Pandas DataFrame so we can add more sentences to our binary vector “corpus” of texts about Thomas Jefferson. All this hand waving about gaps in the vectors and sparse vs. dense bags of words should become clear as we add more sentences, and their corresponding bag-of-words vectors, to our DataFrame (table of vectors corresponding to texts in a corpus).

  
>>> import pandas as pd
>>> df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T
>>> df
      26.  Jefferson  Monticello  Thomas  age  at  began  building  of  the
sent    1          1           1       1    1   1      1         1   1    1
 

Let’s add a few more texts to our corpus to see how a DataFrame stacks up. A DataFrame indexes both the columns (documents) and rows (words) so it can be an “inverse index” for document retrieval, in case we want to find a Trivial Pursuit answer in a hurry.

  
>>> sentences = "Construction was done mostly by local masons and carpenters.\n" \
...             "He moved into the South Pavilion in 1770.\n" \
...             "Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."
>>> corpus = {}
>>> corpus['sent0'] = dict((tok.strip('.'), 1) for tok in sentence.split())
>>> for i, sent in enumerate(sentences.split('\n')):
...     corpus['sent{}'.format(i + 1)] = dict((tok, 1) for tok in sent.split())
>>> df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
>>> df[df.columns[:7]]  # show just the first 7 tokens (columns)
       1770  26.  Construction  He  Jefferson  Jefferson's  Monticello
sent0     0    1             0   0          1            0           1
sent1     0    0             1   0          0            0           0
sent2     1    0             0   1          0            0           0
sent3     0    0             0   0          0            1           1
 

With a quick scan you can see very little overlap in word usage for these sentences. Among the first seven words in our vocabulary, only the word Monticello appears in more than one sentence. Now we just need to be able to compute this overlap within our pipeline whenever we want compare documents or search for similar documents. One way to check for the similarities between sentences is to count the number of overlapping tokens using a dot product:

  
>>> df = df.T
>>> df.sent0.dot(df.sent1)
0
>>> df.sent0.dot(df.sent2)
1
>>> df.sent0.dot(df.sent3)
1
 
sent3, i.e. the word that gave us that last dot product of 1:
  
>>> [(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]
[('Monticello', 1)]
 

From this we can tell that one word was used in both sent0 and sent2. Likewise, one of the words in our vocabulary was used in both sent0 and sent3. This overlap of words is a measure of their similarity. Interestingly, that “oddball” sentence, sent1, was the only sentence that did not mention Jefferson or Monticello directly, but used a completely different set of words to convey information about other anonymous people. Here’s one way to find the word that is shared by sent0 and sent3, i.e. the word that gave us that last dot product of 1:

 

>>> [(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]
[('Monticello', 1)]

 

This is our first “Vector Space Model” (VSM) of natural language documents (sentences). Not only are dot products possible, but other vector operations are defined for these bag-of-word vectors: addition, subtraction, OR, AND, etc. We can even compute things like Euclidean distance or the angle between these vectors. This representation of a document as a binary vector has a lot of power. It was a mainstay for document retrieval and search for many years. All modern CPUs have hardwired memory addressing instructions that can efficiently hash, index, and search a large set of binary vectors like this. Though these instructions were built for another purpose (indexing memory locations to retrieve data from RAM), they are equally efficient at binary vector operations for search and retrieval of text.

That’s all for now.


For more, check out the whole book on liveBook here and see this Slideshare Presentation.