From Getting Started with Natural Language Processing by Ekaterina Kochmar

This article shows you how to extract the meaningful bits of information from raw text and how to identify their roles. Let’s first look into why identifying roles is important.


Take 40% off Getting Started with Natural Language Processing by entering fcckochmar into the discount code box at checkout at manning.com.


Understanding word types

The first fact to notice is that there‘s a conceptual difference between the bits of the expression like “[Harry] [met] [Sally]”: “Harry” and “Sally” both refer to people participating in the event, and “met” represents an action. When we humans read text like this, we subconsciously determine the roles each word or expression plays along those lines: to us, words like “Harry” and “Sally” can only represent participants of an action but can’t denote an action itself, and words like “met” can only denote an action. This helps us get at the essence of the message quickly: we read “Harry met Sally” and we understand [HarryWHO] [metDID_WHAT] [SallyWHOM].

This recognition of word types has two major effects: the first effect is that the straightforward unambiguous use of words in their traditional functions helps us interpret the message. Funnily enough, this applies even when we don’t know the meaning of the words. Our expectations about how words are combined in sentences and what roles they play are strong, and when we don’t know what a word means such expectations readily suggest what it might mean: e.g., we might not be able to exactly pin it down, but we can still say that an unknown word means some sort of an object or some sort of an action. This “guessing game” is familiar to anyone who has ever tried learning a foreign language and had to interpret a few unknown words based on other, familiar words in the context. Even if you are a native speaker of English and never tried learning a different language, you can still try playing a guessing game, for example, with nonsensical poetry. Here’s an excerpt from “Jabberwocky”, a famous nonsensical poem by Lewis Carroll:[1]


Figure 1. An example of text where the word meaning can only be guessed


Some of the words here are familiar to anyone, but what do “Jabberwock”, “Bandersnatch” and “frumious” mean? It’s impossible to give a precise definition for any of them because these words don’t exist in English or any other language, and their meaning is anybody’s guess. One can say with high certainty that “Jabberwock” and “Bandersnatch” are some sort of creatures, and “frumious” is some sort of quality.[2] How do we make such guesses? You might notice that the context for these words gives us some clues: for example, we know what “beware” means. It’s an action, and as an action it requires some participants: one doesn’t normally “beware”, one needs to beware of someone or something. We expect to see this someone or something, and here comes “Jabberwock”. Another clue is given away by the word “the” which normally attaches itself to objects (like “the car”) or creatures (like “the dog”), and we arrive at an interpretation of “Jabberwock” and “Bandersnatch” being creatures. Finally, in “the frumious Bandersnatch” the only possible role for “frumious” is some quality because this is how it typically works in language: e.g. “the red car” or “the big dog”.

The second effect that the expectations about the roles that words play have on our interpretation is that we tend to notice when these roles are ambiguous or somehow violated, because such violations create a discordance. This is why ambiguity in language is a rich source of jokes and puns, intentional or not. Here’s one expressed in a news headline:


Figure 2. An example of ambiguity in action


What is the first reading that you get? You wouldn’t be the only one if you read this as if “Police help a dog to bite a victim”, but common sense suggests that the intended meaning is probably “Police help a victim with a dog bite (or, that was bitten by a dog)”. News headlines are rich in ambiguities like that because they use a specific format aimed at packing the maximum amount of information in a shortest possible expression. This sometimes comes at a price as both “Police help a dog to bite a victim” and “Police help a victim with a dog bite (that was bitten by a dog)” are clearer but longer than “Police help dog bite victim” that a newspaper might prefer to use. This ambiguity isn’t necessarily intentional, but it’s easy to see how this can be used to make endless jokes.

What exactly causes confusion here? It’s clear that “police” denotes a participant in an event, and “help” denotes the action. “Dog” and “victim” also seem to unambiguously be participants of an action, but things are less clear with “bite”. “Bite” can denote an action as in “Dog bites a victim” or a result of an action as in “He has mosquito bites.” In both cases, what we read is a word “bites”, and it doesn’t give away any further clues as to what it means, but in “Dog bites a victim” it answers the question “What does the dog do?” and in “He has mosquito bites” it answers the question “What does he have?”. Now, when you see a headline like “Police help dog bite victim”, your brain doesn’t know straight away which path to follow:

  • Path 1: “bite” is an action answering the question “what does one do?” “Police help dog [biteDO_WHAT] victim”
  • Path 2: “bite” is the result of an action answering the question “what happened?” “Police help dog [biteWHAT] victim”.

Apart from the humorous effect of such confusions, ambiguity may also slow the information processing down and lead to misinterpretations. Try solving Exercise 1 to see how the same expression may lead to completely different readings.

Exercise 1:

      (1) I can fish

      (2) I saw her duck

Solution: These are quite well-known examples that are widely used in NLP courses to exemplify ambiguity in language and its effect on interpretation.

In (1), “I” certainly denotes a person, and “can” certainly denotes an action, but “can” as an action has two potential meanings: it can denote ability “I can” = “I am able to” or the action of putting something in cans.[3] “Fish” can denote an animal as in “freshwater fish” (or a product as in “fish and chips”), or it can denote an action as in “learn to fish”. In combination with the two meanings of “can” these can produce two completely different readings of the same sentence: either “I can fish” means “I am able / I know how to fish” or “I put fish in cans”.

In (2), “I” is a person and “saw” is an action, but “duck” may mean an animal or an action of ducking. In the first case, the sentence means that I saw a duck that belongs to her, and in the second it means that I witnessed how she ducked – once again, completely different meanings of what seems to be the same sentence!


Figure 3. Ambiguity might result in some serious misunderstanding[4]


This far, we’ve been using the terminology quite frivolously: we’ve been defining words as denoting actions or people or qualities, but in fact there are more standard terms for that. The types of words defined by the different functions that words might fulfill are called parts-of-speech, and we distinguish between a number of such types:

  • words that denote objects, animals, people, places and concepts are called nouns;
  • words that denote states, actions and occurrences are called verbs;
  • words that denote qualities of objects, animals, people, places and concepts are called adjectives;
  • those for qualities of actions, states and occurrences are called adverbs.

Table 1 provides some examples and descriptions of different parts-of-speech:

Table 1. Examples of words of different parts-of-speech

Part-of-speech

Nouns

What it denotes

Objects, people, animals, places, concepts, time references

Examples

car, Einstein, dog, Paris, calculation, Friday

Verbs

Actions, states, occurrences

meet, stay, become, happen

Adjectives

Qualities of objects, people, animals, places, concepts

red car, clever man, big dog, beautiful city, fast calculation

Adverbs

Qualities of actions, states, occurrences

meet recently, stay longer, just become, happen suddenly

Articles

Don’t have a precise meaning of their own, but show whether the noun they are attached to is identifiable in context (it is clear what / who the noun is referring to) or not (the noun hasn’t been mentioned before)

I saw a man = This man is mentioned for the first time (“a” is an indefinite article)

The man is clever = This suggests that it should be clear from the context which particular man we are talking about (“the” is a definite article)

Prepositions

Don’t have a precise meaning of their own, but serve as a link between two words or groups of words: for example, linking a verb denoting action with nouns denoting participants, or a noun to its attributes

meet on Friday – links action to time

meet with administration – links action to participants

meet at house – links action to location

a man with a hat – links a noun to its attribute

This isn’t a comprehensive account of all parts-of-speech in English, but with this brief guide you should be able to recognize the roles of the most frequent words in text and this suite of word types should provide you with the necessary basis for implementation of your own information extractor.

Why do we care about the identification of word types in the context of information extraction and other tasks? You’ve seen above that correct and straightforward identification of types helps information processing, although ambiguities lead to misunderstandings. This is precisely what happens with the automated language processing: machines like humans can extract information from text better and more efficiently if they can recognize the roles played by different words, although misidentification of these roles may lead to mistakes of various kinds. For instance, having identified that “Jabberwock” is a noun and some sort of a creature, a machine might be able to answer a question like “Who is Jabberwock?” (e.g., “Someone / Something with jaws that bite and claws that catch”), although if a machine processed “I can fish” as “I know how to fish” it wouldn’t be able to answer the question “What did you put in cans?”

Luckily, there are NLP algorithms that can detect word types in text, and such algorithms are called part-of-speech taggers (or POS taggers). Figure 4 presents a mental model to help you put POS taggers into the context of other NLP techniques:


 Figure 4. Mental Model that visualizes the flow of information between different NLP components


As POS tagging is an essential part of many tasks in language processing, all NLP toolkits contain a tagger and often you need to include it in your processing pipeline to get at the essence of the message. Let’s now look into how this works in practice.

Part-of-speech tagging with spaCy

I want to introduce spaCy[5] – a useful NLP library that you can put under your belt. A number of reasons to look into spaCy in this book are:

  • NLTK and spaCy have their complementary strengths, and it’s good to know how to use both;
  • spaCy is an actively supported and fast-developing library that keeps up-to-date with the advances in NLP algorithms and models;
  • A large community of people work with this library, and you can find code examples of various applications implemented with or for spaCy on their webpage,[6] as well as find answers to your questions on their github;
  • spaCy is actively used in industry; and
  • It includes a powerful set of tools particularly applicable to large-scale information extraction.

Unlike NLTK that treats different components of language analysis as separate steps, spaCy builds an analysis pipeline from the beginning and applies this pipeline to text. Under the hood, the pipeline already includes a number of useful NLP tools which are run on input text without you needing to call on them separately. These tools include, among others, a tokenizer and a POS tagger. You apply the whole lot of tools with a single line of code calling on the spaCy processing pipeline, and then your program stores the result in a convenient format until you need it. This also ensures that the information is passed between the tools without you taking care of the input-output formats. Figure 5 visualizes spaCy’s NLP pipeline, that we’re going to discuss in more detail next:


Figure 5. spaCy’s processing pipeline with some intermediate results[7]


Machines, unlike humans, don’t treat input text as a sequence of sentences or words – for machines, text is a sequence of symbols. The first step that we applied before was splitting text into words – this step is performed by a tool called tokenizer. Tokenizer uses raw text as an input and returns a list of words as an output. For example, if you pass it a sequence of symbols like “Harry, who Sally met”, it returns a list of tokens [“Harry”, “,”, “who”, …] Next, we apply a stemmer that converts each word to some general form: this tool takes a word as an input and returns its stem as an output. For instance, a stemmer returns a generic, base form “meet” for both “meeting” and “meets”. A stemmer can be run on a list of words, where it treats each word separately and returns a list of correspondent stems. Other tools require an ordered sequence of words from the original text: for example, we’ve seen that it’s easier to figure out that Jabberwock is a noun if we know that it follows a word like “the”; order matters for POS tagging. This means that each of the three tools – tokenizer, stemmer, POS tagger – requires a different type of input and produces a different type of output, and in order to apply them in sequence we need to know how to represent information for each of them. This is what spaCy’s processing pipeline does for you: it runs a sequence of tools and connects their outputs together.

For information retrieval we opted for stemming that converts different forms of a word to a common core. We said that it’s useful because it helps connect words together on a larger scale, but it also produces non-words: you won’t always be able to find stems of the words (e.g. something like “retriev”, the common stem of retrieval and retrieve) in a normal dictionary. An alternative to this tool is lemmatizer, which aims at converting different forms of a word to its base form which can be found in a dictionary: for instance, it returns a lemma retrieval that can be found in a dictionary. Such base form is called lemma. In its processing pipeline, spaCy uses a lemmatizer.

The starting point for spaCy’s processing pipeline is, as before, raw text: for example, “On Friday board members meet with senior managers to discuss future development of the company.” The processing pipeline applies tokenization to this text to extract individual words: [“On”, “Friday”, “board”, …]. The words are then passed to a POS tagger that assigns parts-of-speech (or POS) tags like [“ADP”, “PROPN”,[8] “NOUN”, …], to a lemmatizer that produces output like [“On”, “Friday”, …, “member”, …, “manager”, …], and to a bunch of other tools.

You may notice that the processing tools in Figure 5 are comprised within a pipeline called nlp. As you’ll shortly see in the code, calling on nlp pipeline makes the program first invoke all the pre-trained tools and then applies them to the input text. The output of all the steps gets stored in a “container” called Doc – it contains a sequence of tokens extracted from input text and processed with the tools. Here’s where spaCy implementation comes close to object-oriented programming: the tokens are represented as Token objects with a specific set of attributes. If you’ve done object-oriented programming before, you’ll hopefully see the connection soon. If not, here’s a brief explanation: imagine you want to describe a set of cars. All cars share the list of attributes they have: with respect to cars, you may want to talk about the car model, size, color, year of production, body style (e.g. saloon, convertible), type of engine, etc. At the same time, such attributes as wingspan or wing area won’t be applicable to cars – they rather relate to planes. You can define a class of objects called Car and require that each object car of this class should have the same information fields, for instance calling on car.model should return the name of the model of the car, for example car.model=“Volkswagen Beetle”, and car.production_year should return the year the car was made, for example car.production_year=“2003”, etc.

This is the approach taken by spaCy to represent tokens in text: after tokenization, each token (word) is packed up in an object Token that has a number of attributes. For instance:

  • token.text contains the original word itself;
  • token.lemma_ stores the lemma (base form) of the word;[9]
  • token.pos_ – its part-of-speech tag;
  • token.i – the index position of the word in text;
  • token.lower_ – lowercase form of the word;

and so on.

The nlp pipeline aims to fill in the information fields like lemma, pos and others with the values specific for each particular token. Because different tools within the pipeline provide different bits of information, the values for the attributes are added on the go. Figure 6 visualizes this process for the words “on” and “members” in the text “On Friday board members meet with senior managers to discuss future development of the company”:


Figure 6. Processing of words “On” and “members” within the nlp pipeline


Now, let’s see how this is implemented in Python code. Listing 1 provides you with an example.

Listing 1. Code exemplifying how to run spaCy’s processing pipeline

 
 import spacy    #A
  
 nlp = spacy.load("en_core_web_sm")    #B
 doc = nlp("On Friday board members meet with senior managers " +
           "to discuss future development of the company.")    #C
  
 rows = []
 rows.append(["Word", "Position", "Lowercase", "Lemma", "POS", "Alphanumeric", "Stopword"])    #D
 for token in doc:
     rows.append([token.text, str(token.i), token.lower_, token.lemma_,
                  token.pos_, str(token.is_alpha), str(token.is_stop)])    #E
  
 columns = zip(*rows)    #F
 column_widths = [max(len(item) for item in col) for col in columns]    #G
 for row in rows:
     print(''.join(' {:{width}} '.format(row[i], width=column_widths[i])
                                  for i in range(0, len(row))))    #H
  

#A Start by importing spaCy library

#B spacy.load command initializes the nlp pipeline. The input to the command is a particular type of data (model) that the language tools were trained on. All models use the same naming conventions (en_core_web_), which means that it’s a set of tools trained on English Web data; the last bit denotes the size of data the model was trained on, where sm stands for ‘small’[10]

#C Provide the nlp pipeline with input text

#D Let’s print the output in a tabular format. For clarity, add a header to the printout

#E Add the attributes of each token in the processed text to the output for printing

#F Python’s zip function[11] allows you to reformat input from row-wise representation to column-wise

#G As each column contains strings of variable lengths, calculate the maximum length of strings in each column to allow enough space in the printout

#H Use format functionality to adjust the width of each column in each row as you print out the results[12]

Here’s the output that this code returns for some selected words from the input text:

  
 Word     Position    Lowercase    Lemma    POS    Alphanumeric    Stopword
 On       0           on           on       ADP    True            False
 Friday   1           friday       friday   PROPN  True            False
 ...
 members  3           members      member   NOUN   True            False
 ...
 to       8           to           to       PART   True            True
 discuss  9           discuss      discuss  VERB   True            False
 ...
 .        15          .            .        PUNCT  False           False
 

This output tells you:

  • The first item in each line is the original word from text – it’s returned by token.text;
  • The second is the position in text, which starts as all other indexing in Python from zero – this is identified by token.i;
  • The third item is the lowercase version of the original word. You may notice that it changes the forms of “On” and “Friday”. This is returned by token.lower_;
  • The fourth item is the lemma of the word, which returns “member” for “members” and “manager” for “managers”. Lemma is identified by token.lemma_;
  • The fifth item is the part-of-speech tag. Most of the tags should be familiar to you by now. The new tags in this piece of text are PART, which stands for “particle” and is assigned to particle “to” in “to discuss”, and PUNCT for punctuation marks. POS tags are returned by token.pos_;
  • The sixth item is a True/False value returned by token.is_alpha, which checks whether a word contains alphabetic characters only. This attribute is False for punctuation marks and some other sequences that don’t consist of letters only, and it’s useful for identifying and filtering out punctuation marks and other non-words;
  • Finally, the last, seventh item in the output is a True/False value returned by token.is_stop, which checks whether a word is in a stopwords list – a list of highly frequent words in language that you might want to filter out in many NLP applications, as they aren’t likely to be informative. For example, articles, prepositions and particles have their is_stop values set to True as you can see in the output above.

Exercise 2:

Now run the code on other input texts. For example, use the “Jabberwocky” poem as an input. Does the code recognize the roles words play (i.e., their parts-of-speech) correctly?

Solution: Despite the fact that a text like “Jabberwocky” contains non-English words, or possibly non-words at all, this Python code is able to tell that “Jabberwock” and “Bandersnatch” are some creatures that have specific names (it assigns a tag PROPN, proper noun to both of them), and that “frumious” is an adjective. How does it do that? Here’s a glimpse under the hood of a typical POS tagging algorithm (see Figure 7):


Figure 7. A glimpse under the hood of a typical POS tagging algorithm


We’ve said earlier that when we try to figure out what type of a word something like “Jabberwock” is we rely on the context. In particular, the previous words are important to take into account: if we see “the”, chances that the next word is a noun or an adjective are high, but a chance that we see a verb next is minimal – verbs shouldn’t follow articles in grammatically correct English. Technically, we rely on two types of intuitions: we use our expectations about what types of words typically follow other types of words, and we also rely on our knowledge that words like “fish” can be nouns or verbs but hardly anything else. We perform the task of word type identification in sequence. For instance, in the example from Figure 7, when the sentence begins, we already have certain expectations about what type of a word we may see first – quite often, it’s a noun or a pronoun (like “I”). Once we’ve established that it’s likely for a pronoun to start a sentence, we also rely on our intuitions about how likely it is that such a pronoun will be exactly “I”. Then we move on and expect to see a particular range of word types after a pronoun – almost certainly it should be a normal verb or a modal verb (as verbs denoting obligations like “should” and “must” or abilities like “can” and “may” are technically called). More rarely, it may be a noun (like “I, Jabberwock”), an adjective (“I, frumious Bandersnatch”), or some other part of speech. Once we’ve decided that it’s a verb, we assess how likely it is that this verb is “can”; if we’ve decided that it’s a modal verb, we assess how likely it is that this modal verb is “can”, etc. We proceed like that until we reach the end of the sentence, and this is where we assess which interpretation we find more likely. This is one possible step-wise explanation of how our brain processes information, on which part-of-speech tagging is based.

The POS tagging algorithm takes into account two types of expectations: an expectation that a certain type of a word (like modal verb) may follow a certain other type of a word (like pronoun), and an expectation that if it’s a modal verb such a verb may be “can”. These “expectations” are calculated using the data: for example, to find out how likely it is that a modal verb follows a pronoun, we calculate the proportion of times we see a modal verb following a pronoun in data among all the cases where we saw a pronoun. For instance, if we saw ten pronouns like “I” and “we” in data before, and five times out of those ten these pronouns were followed by a modal verb like “can” or “may” (as in “I can” and “we may”), what’s the likelihood, or probability, or seeing a modal verb following a pronoun be? Figure 8 gives a hint on how probability can be estimated:


Figure 8. If modal verb follows pronoun 5 out of 10 times, the probability is 5/10


We can calculate it as:

        Probability(modal verb follows pronoun) = 5 / 10

or in general case:

         Probability(modal verb follows pronoun) =
         How_often(pronoun is followed by verb) /
         How_often(pronoun is followed by any type of word, modal verb or not)

To estimate how likely (or how probable) it is that the pronoun is “I”, we need to take the number of times we’ve seen a pronoun “I” and divide it by the number of times we’ve seen any pronouns in the data. If among those ten pronouns that we’ve seen in the data before seven were “I” and three were “we”, the probability of seeing a pronoun “I” is estimated as Figure 9 illustrates:


Figure 9. If 7 times out of 10 the pronoun is “I”, the probability of a word being “I’ given that we know the POS of such a word is pronoun is 7/10


        Probability(pronoun being “I”) = 7 / 10

or in general case:

         Probability(pronoun being “I”) =
         How_often(we’ve seen a pronoun “I”) /
         How_often(we’ve seen any pronoun, “I” or other)

In the end, the algorithm goes through the sequence of tags and words one by one, and takes all the probabilities into account. Because the probability of each decision, each tag and each word is a separate component in the process, these individual probabilities are multiplied. To find out how probable it is that “I can fish” means “I am able / know how to fish”, the algorithm calculates:

         Probability(“I can fish” is “pronoun modal_verb verb”) =
         probability(a pronoun starts a sentence) *
         probability(this pronoun is “I”) *
         probability(a pronoun is followed by a modal verb) *
         probability(this modal verb is “can”) *
         … *
         probability(a verb finishes a sentence)

This probability gets compared with the probabilities of all the alternative interpretations, like “I can fish” = “I put fish in cans”:

         Probability(“I can fish” is “pronoun verb noun”) =
         probability(a pronoun starts a sentence) *
         probability(this pronoun is “I”) *
         probability(a pronoun is followed by a verb) *
         probability(this verb is “can”) *
         … *
         probability(a noun finishes a sentence)

In the end, the algorithm compares the calculated probabilities for the possible interpretations and chooses the one which is more likely, i.e. has higher probability.

That’s all for this article. We’re going to move onto syntactic parsing in part 2.

If you want to learn more about the book, you can preview its contents on our browser-based liveBook platform here.

 


[2] A blend of “fuming” and “furious”, according to Lewis Carroll himself.

[3] Formally, when a word has several meanings this is called lexical ambiguity.

[5] To get more information on the library, check https://spacy.io. Installation instructions walk you through the installation process depending on the operating system you’re using: https://spacy.io/usage#quickstart.

[6] See various projects implemented using spaCy or developed for spaCy here: https://spacy.io/universe

[7] The diagram follows closely the one in the spaCy’s documentation: https://spacy.io/usage/processing-pipelines

[8] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such. For more information on the tags, see documentation here: https://spacy.io/api/annotation.

[9] You may notice that some attributes are called on using an underscore, like token.lemma_. This is applicable when spaCy has two versions for the same attribute: for example, token.lemma returns an integer version of the lemma, which represents a unique identifier of the lemma in the vocabulary of all lemmas existing in English, and token.lemma_ returns a Unicode (plain text) version of the same thing – see the description of the attributes on https://spacy.io/api/token.

[10] Check out the different language models available for use with spaCy: https://spacy.io/models/en. Small model (en_core_web_sm) is suitable for most purposes and it’s more efficient to upload and use, but larger models like en_core_web_md (medium) and en_core_web_lg (large) are more powerful and some NLP tasks require the use of such larger models.

[11] Check out documentation on Python’s functions here: https://docs.python.org/3/library/functions.html

[12] Check out string formatting techniques in Python 3: https://docs.python.org/3/library/string.html