From Getting Started with Natural Language Processing by Ekaterina Kochmar

This article shows you how to extract the meaningful bits of information from raw text and how to identify their roles. Once you have roles identified, you can move on to syntactic parsing.


Take 40% off Getting Started with Natural Language Processing by entering fcckochmar into the discount code box at checkout at manning.com.


See part 1 here.

Understanding sentence structure with syntactic parsing

In this section you learn how to automatically establish the types of relations that link meaningful words together.

Listing 1. Code exemplifying how to run spaCy’s processing pipeline

 
 import spacy    #A
  
 nlp = spacy.load("en_core_web_sm")    #B
 doc = nlp("On Friday board members meet with senior managers " +
           "to discuss future development of the company.")    #C
  
 rows = []
 rows.append(["Word", "Position", "Lowercase", "Lemma", "POS", "Alphanumeric", "Stopword"])    #D
 for token in doc:
     rows.append([token.text, str(token.i), token.lower_, token.lemma_,
                  token.pos_, str(token.is_alpha), str(token.is_stop)])    #E
  
 columns = zip(*rows)    #F
 column_widths = [max(len(item) for item in col) for col in columns]    #G
 for row in rows:
     print(''.join(' {:{width}} '.format(row[i], width=column_widths[i])
                                  for i in range(0, len(row))))    #H
  

#A Start by importing spaCy library

#B spacy.load command initializes the nlp pipeline. The input to the command is a particular type of data (model) that the language tools were trained on. All models use the same naming conventions (en_core_web_), which means that it’s a set of tools trained on English Web data; the last bit denotes the size of data the model was trained on, where sm stands for ‘small’[1]

#C Provide the nlp pipeline with input text

#D Let’s print the output in a tabular format. For clarity, add a header to the printout

#E Add the attributes of each token in the processed text to the output for printing

#F Python’s zip function[2] allows you to reformat input from row-wise representation to column-wise

#G As each column contains strings of variable lengths, calculate the maximum length of strings in each column to allow enough space in the printout

#H Use format functionality to adjust the width of each column in each row as you print out the results[3]

Why sentence structure is important

Now you know how to detect which words belong to which types. Your algorithm from Code Listing 1 is able to tell that in a sentence “On Friday, board members meet with senior managers to discuss future development of the company.”, words like “Friday”, “board”, “members” and “managers” are more likely to be participants of some actions as they are nouns, and words like “meet” and “discuss” denote actions themselves as they are verbs. This brings your one step closer to solving the task, but one bit is still missing – how are these words related to each other, and which of the potential participants are participants of the action in question? More specifically, who met with whom?

We said before that in the simplest case, returning the words immediately before and immediately after the word that denotes the action works in some cases, but in the sentence at hand this doesn’t work: part-of-speech tagging helps you identify that “meet (with)” is an action, but you need to return “board members” and “senior managers” as the two participants. This far, the algorithm is only able to detect that “board”, “members” and “managers” are nouns, and “senior” is an adjective, but it hasn’t linked the words together yet. The next step is to identify that “board” and “members” together form one group of words and “senior” and “managers” another group, and these two groups represent the participants in the action as they are both directly related to the verb “meet (with)”. These aren’t the only words that are related to the action of meeting: in fact, the group of words “On Friday” tells us about the time of the meeting, and “to discuss future development of the company” tells us about the purpose. Ideally, we want to get all these bits of information. Figure 1 visualizes this idea:


Figure 1. All bits of information related to the action of meeting as we expect them to be identified by a parser


In this representation, we put the action “meet (with)” at the center or root of the whole account of events because it makes it easier to detect other participants who’re involved in this action and other bits of information related to it starting from the verb.

Word types that we’ve defined in the previous step help us identify the groups of words and their relations to the main action here: it’s common for the participants to be expressed with groups of words involving nouns, for locations and time references – to be attached to the verb with a preposition (like “on Friday” or “at the office”), and the purpose of the meeting is often introduced using “to” and a further expression involving a verb (like “discuss” in this example). We rely on such intuitions when we detect which words are related to each other, and machines use a similar approach.

Definition: Parser

The tool that helps identify which words are related to each other is called parser.

To give you the flavor of the task before we move on to using this tool in practice, here’s an example illustrating why parsing and identification of relations between words isn’t a trivial task and may lead to misunderstandings, like POS tagging before. This example comes from the joke by Groucho Marx, which went like this: “One morning I shot an elephant in my pajamas. How he got into my pajamas I’ll never know”. What exactly produces the humorous effect here? It’s precisely the identification of relation links between the groups of words! Under one interpretation, “in my pajamas” is attached to “an elephant”, and, in fact, because the two groups of words are next to each other in the sentence, this is a much easier interpretation to process, and our brain readily suggests it. Common sense tells us that “in my pajamas” should be attached to “shot” and that it was “I” who was wearing the pajamas, not an elephant. The problem is that these components are separated from each other by other words, and based on the structure of the sentence, this isn’t the first interpretation that comes to mind. What adds to ambiguity here’s the fact that prepositional phrases (the ones that start with prepositions like “in” or “with”) are frequently attached to nouns (“to a manNOUN with a hammer, everything looks like a nail”) as well as to verbs (“driveVERB nails with a hammer”). Parsers, like humans, rely on the patterns of use in language and use the information about the types of words to identify how words are related to each other, but it’s by no means a straightforward task.

Dependency parsing with spaCy

We said before that as we’re interested in the action expressed by a verb (like “meet (with)”) and its participants, it’s the action that we put at the center or at the root of the whole expression. Having done that, we start working from the verb at the root trying to identify which words or groups of words are related to this action verb. We also say that when the words are related to this verb, they depend on it. If the action is denoted by the verb “meet”, starting from this verb we try to find words that answer relevant questions: e.g. “who meet(s)?” “board members”, “meet with whom?” “senior managers”, “meet when?” “on Friday”, and like this. It’s as if we are saying that “meet” is the most important, most indispensable, core bit of information here, and the other bits are dependent on it. After all, if it wasn’t for the verb “meet”, the meeting wouldn’t have taken place and there’s no need in extracting any further information! Similarly, in the expression “board members” the core bit is “members” as “board” only provides further clarifications (“what type of members?” “board”) but without “members” there’s no need in providing this clarifying information. Figure 2 visualizes dependencies between words in this sentence, where the arrows explicitly show the direction of relation – they go from the head to the dependent in each pair:


Figure 2. Flow of dependencies in “On Friday board members meet with senior managers”. Arrows show the direction of the dependency, from the head to the dependent.


Putting verbs at the root of the whole expression as well as dividing words into groups of more important ones (such words are technically called heads) and the ones that provide additional information depending on the heads (such words are called dependents) is a convention adopted in NLP. The approach to parsing that relies on this idea is therefore called dependency parsing.

Exercise 3:

To summarize, heads are words that express the core bit of information in a group of words; they’re the indispensable ones. Dependents are the ones that attach themselves to heads providing additional clarifications or complementing the heads. Try to identify heads and dependents in the following expressions:

(1) senior managers

(2) recently met

(3) the government

(4) talk to the government

Solution:

(1) In “senior managers”, “managers” is the main bit and “senior” provides further clarification. We ask “what type of managers?” “senior”, and “managers” is the head and “senior” is the dependent.

(2) In “recently met”, “met” is the main bit and “recently” provides further information about the action. We can ask “met when?” “recently”, and “met” is the head and “recently” is the dependent.

(3) In “the government”, “government” is the main bit and “the” tells us that it’s some particular government identifiable from the context. “Government” is the head and “the” is the dependent.

(4) In “talk to the government”, the overall head is “talk” – this is the action that we start with. “Talk” directly attaches “to”, and “to” is dependent on the head “talk”. “To” in its turn attaches “government”, and “to” is the head and “government” is the dependent in this pair. Finally, as before, “the” is the dependent of the head “government” within the pair of words “the government”. Figure 3 visualizes this chain of heads and dependents where the arrows explicitly show the direction of relation as before:


Figure 3. The full chain of dependencies in “talk to the government”. Arrows show the direction of the dependency, from the head to the dependent


Now it’s time to add one more tool, parser, to the suite:


Figure 4. A larger suite of spaCy tools


Let’s see how spaCy performs parsing on the sentence “On Friday, board members meet with senior managers to discuss future development of the company”. First of all, let’s identify all groups of words that may be participants in the meeting event: which is, let’s identify all nouns and words attached to these nouns in this sentence. Such groups of words are called noun phrases because they have nouns as their heads:

Listing 2. Code to identify all groups of nouns and the way they are related to each other

 
    import spacy
  
 nlp = spacy.load("en_core_web_sm")    #A
 doc = nlp("On Friday, board members meet with senior managers " +
           "to discuss future development of the company.")    #B
  
 for chunk in doc.noun_chunks:    #C
     print('\t'.join([chunk.text, chunk.root.text, chunk.root.dep_,
                      chunk.root.head.text]))    #D
  

#A As before, start by importing spaCy library and initializing the pipeline. If you’re working in the same notebook, you don’t need to do it more than once

#B Provide the nlp pipeline with input text

#C You can access groups of words involving nouns with all related words (aka noun phrases) by doc.noun_chunks

#D Within each noun phrase, print out the phrase itself (e.g., senior managers), print the head of the noun phrase (the head noun, e.g. managers), print the type of relation that links this head noun to the next most important word in the sentence (e.g., pobj relation links managers to meet with), and finally print the next most important word itself (e.g., with). Join this output by tabulation

Let’s discuss these functions one by one:

  • doc.noun_chunks returns the noun phrases – the groups of words that have a noun at their core and all the related words. For instance, “senior managers” is one such group here;
  • chunk.text prints the original text representation of the noun phrase, for instance “senior managers”;
  • chunk.root.text identifies the head noun and prints it out. In “senior managers” it’s “managers” which is the main word – it’s the root of the whole expression;
  • chunk.root.dep_ shows what relates the head noun to the rest of the sentence. Which word is “managers” from “senior managers” directly related to? It’s the preposition “with” (in “with senior managers”). Within this longer expression, “senior managers” is the object of the preposition, or prepositional object – pobj;
  • Finally, chunk.root.head.text prints out the word the head noun is attached to – in this case, “with” itself.

To test your understanding, try to predict what this code produces before running it or looking at the output below.

The code above identifies the following noun phrases in this sentence:

 
       Friday            Friday           pobj    On
       board members      members          nsubj   meet
       senior managers    managers         pobj    with
       future development development      dobj    discuss
       the company        company          pobj    of
  

Exactly five noun phrases are in this sentence: “Friday”, “board members”, “senior managers”, “future development” and “the company”. Let’s look at the visualization of the chain of dependencies in this sentence, this time with the relation types assigned to the connecting arrows:[4]


Figure 5. Chain of dependencies in “board members meet with senior managers to discuss future development of the company.”


In this sentence, “Friday” directly relates to “on” – it’s the prepositional object (pobj) of “on”. “Board members”, as Figure 5 visualizes, has “members” as its head, and it’s directly attached to “meet” – it’s the subject, i.e. the main participant of the action (denoted nsubj). “Senior members” has “members” as its head, and it’s attached to “with” as pobj. “Future development” is a direct object (dobj) of the verb “discuss”, because it answers the question “discuss what?” The head of this noun phrase is “development”. Finally, “the company” has “company” as its head, and it depends on preposition “of”, and the relation that links it to “of” is pobj.

spaCy allows you to visualize the dependency information and print the graphs like the ones in Figure 5. The code from Listing 3 allows you to print out the visualization of the dependencies in input text and store it to a file. If you run it on the sentence “Board members meet with senior managers to discuss future development of the company”, this file contains exactly the graph from Figure 5:

Listing 3. Code to visualize the dependency information

 
 from spacy import displacy    #A
 from pathlib import Path    #B
  
 svg = displacy.render(doc, style='dep', jupyter=False)    #C
 file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
 output_path = Path(file_name)    #D
 output_path.open("w", encoding="utf-8").write(svg)    #E
  

#A Import spaCy’s visualization tool displacy[5]

#B Path helps you define the location for the file to store the visualization

#C Use displacy to visualize dependencies over the input text; jupyter=False tells the program to store the output to an external file, and jupyter=True displays it within the notebook

#D The file the output is stored to uses the words from the sentence in its name, e.g. “On-Friday-board-…svg”. You can change the file naming in this line of code

#E This line writes the output to the specified file

Why is it useful to know about the noun phrases and the way they’re related to the rest of the sentence? It’s because this way your algorithm learns about the groups of words including nouns and attached attributes (i.e., noun phrases) which are potential participants in the action, and it also learns what these noun phrases are themselves attached to: for instance, note that “board members” is linked to “meet” directly – it’s the main participant of the action, the subject. “Senior managers” is connected to the preposition “with”, which is directly linked to the action verb “meet”, and it’s possible to detect that “senior managers” is the second participant in the action within one small step.

Before we put these components together and identify the participants of the meeting action, let’s iterate through the sentence and print out the relevant information about each word in this sentence: let’s print the word itself using token.text, the relation that links this word to its head using token.dep_, the head the word depends on using token.head.text and this head’s part-of-speech with token.head.pos_, and finally all the dependents of the word iterating through the list of dependents extracted using token.children:

Listing 4. Code to print out the information about head and dependents for each word

 
 for token in doc:
     print(token.text, token.dep_, token.head.text, token.head.pos_,
                                 [child for child in token.children])    #A
  

#A This code assumes that spaCy is imported and input text is already fed into the pipeline

This code produces the following output for the sentence “On Friday board members meet with senior managers to discuss future development of the company.”:

 
   On prep      meet    VERB    [Friday]
    Friday        pobj    On      ADP     []
    ,   punct     meet    VERB    []
    board         compound         members NOUN    []
    members nsubj meet    VERB    [board]
    meet ROOT    meet    VERB    [On, ,, members, with, discuss, .]
    ...
 

This output shows that “Friday” is the prepositional object of “On”, which itself has an adposition (ADP) POS tag. “Friday” doesn’t have any dependants, and an empty list [] is returned. “Board” is dependent on noun “members” but it also has no further dependencies itself. “Members” is a subject of the verb “meet” and has “board” as a single dependant. “Meet”, in its turn, doesn’t depend on any other word – it’s the ROOT of the whole sentence, and it has a number of dependants, including “On” (time reference, “On Friday”), “members” (subject, the main participant of the action, “board members”), “with” (introducing second participant “with senior members”), and “discuss” (indicating the purpose of the meeting, “to discuss the future developments …”).

Building your own Information Extraction algorithm

Now let’s put all these components together and run your information extractor on a list of sentences to only extract the information about who met with whom. Based on what you’ve done this far, you need to implement the steps outlined in Figure 6:


Figure 6. Extraction of participant1 and participant2 if the action verb is “meet”


To summarize, this means that you need to:

  1. Identify sentences where “meet” is the main verb, i.e. the ROOT of the sentence.
  2. Extract dependents of this verb using token.children.
  3. Identify participant1 of the action – it’s a noun linked to the verb with nsubj relation.
  4. Add all the attributes this noun has (e.g. “board” for “members”) to build a noun phrase (NP). This is participant1.
  5. If the verb has a dependent preposition “with” (e.g. “meet with managers”), extract the noun dependent on “with” together with all its attributes – these constitute participant2.
  6. Otherwise, if the verb doesn’t have a preposition “with” attached to it but has a directly related noun (as in “meet managers”), extract this noun and its attributes as participant2. The directly related noun is attached to the verb with dobj relation.

Now, let’s implement this in Python and apply the code to the sentence “On Friday, board members meet with senior managers to discuss future development of the company.” Note that if you’re working in the same notebook and used this sentence as input before, all the processing outputs are stored in container Doc, and you don’t need to redefine it. Because this sentence contains preposition “with”, let’s start with implementing the approach that extracts the noun dependent on “with” together with its attributes and identifies this noun phrase as participant2. Code Listing 5 shows this implementation:

Listing 5. Code to extract participants of the action

 
 for token in doc:    #A
     if token.lemma_=="meet" and token.pos_=="VERB" and token.dep_=="ROOT":    #B
         action = token.text    #C
         children = [child for child in token.children]    #D
         participant1 = ""
         participant2 = ""
         for child1 in children:
             if child1.dep_=="nsubj":
                 participant1 = " ".join([attr.text for
                                          attr in child1.children]) + " " + child1.text    #E
             elif child1.text=="with":    #F
                 action += " " + child1.text
                 child1_children = [child for child in child1.children]
                 for child2 in child1_children:
                     if child2.pos_ == "NOUN":
                         participant2 = " ".join([attr.text for
                                              attr in child2.children]) + " " + child2.text    #G
 print (f"Participant1 = {participant1}")
 print (f"Action = {action}")
 print (f"Participant2 = {participant2}")    #H
  

#A This code assumes that spaCy is imported and input text is already fed into the pipeline

#B Check that the ROOT of the sentence is a verb with the base form (lemma) “meet”

#C This verb expresses the action

#D Extract the list of all dependants of this verb using token.children

#E Find the noun which is the subject of the action verb using nsubj relation. This noun, together with its attributes (children), expresses participant1 of the action

#F Check if the verb has preposition “with” as one of its dependants

#G Extract the noun which is dependent on this preposition together with its attributes. This is participant2 of the action

#H Print out the results

For the input text “On Friday, board members meet with senior managers to discuss future development of the company.” this code correctly returns the following output:

 
        Participant1 = board members
         Action = meet with
         Participant2 = senior managers
 

What if we provide it with more diverse sentences? For example:

  • “Boris Johnson met with the Queen last week.” – “Queen” is a proper noun, and its tag is PROPN rather than NOUN. Let’s make sure that proper nouns are also covered by the code. Note that “met” is the past form of “meet”, and because your algorithm uses lemma (base form) of the word, it’s correctly identified here.
  • “Donald Trump meets the Queen at Buckingham Palace.” – “the Queen” is attached to the verb “meet” as dobj. Let’s make sure your code covers this case, too.

Code Listing 6 shows how to add these two modifications to the algorithm:

Listing 6. Code for Information Extractor

 
 sentences = ["On Friday, board members meet with senior managers " +
              "to discuss future development of the company.",
              "Boris Johnson met with the Queen last week.",
              "Donald Trump meets the Queen at Buckingham Palace.",
              "The two leaders also posed for photographs and " +
              "the President talked to reporters."]    #A
  
  
 def extract_information(doc):    #B
     action=""
     participant1 = ""
     participant2 = ""
     for token in doc:
         if token.lemma_=="meet" and token.pos_=="VERB" and token.dep_=="ROOT":
             action = token.text
             children = [child for child in token.children]  
             for child1 in children:
                 if child1.dep_=="nsubj":
                     participant1 = " ".join([attr.text for
                                              attr in child1.children]) + " " + child1.text
                 elif child1.text=="with":
                     action += " " + child1.text
                     child1_children = [child for child in child1.children]
                     for child2 in child1_children:
                         if child2.pos_ == "NOUN" or child2.pos_ == "PROPN":    #C
                             participant2 = " ".join([attr.text for
                                                  attr in child2.children]) + " " + child2.text
                 elif child1.dep_=="dobj" and (child1.pos_ == "NOUN"
                                               or child1.pos_ == "PROPN"):    #D
                     participant2 = " ".join([attr.text for
                                              attr in child1.children]) + " " + child1.text
     print (f"Participant1 = {participant1}")
     print (f"Action = {action}")
     print (f"Participant2 = {participant2}")
  
 for sent in sentences:
     print(f"\nSentence = {sent}")
     doc = nlp(sent)
     extract_information(doc)    #E
  

#A Provide your code with a diverse set of sentences. Note that all but last sentence contain verb “meet” and are relevant for your information extraction algorithm

#B Define a method to apply all the steps in the information extraction algorithm

#C Note that this code is similar to Listing 5. One of the differences is that it applies to participants expressed with proper nouns (PROPN) as well as nouns (NOUN)

#D Add the elif branch that covers the direct object (dobj) case

#E Apply extract_information method to each sentence and print out the actions and participants

The code above identifies the following actions and participants in each sentence from the set:

 
        Sentence = On Friday, board members [...]
         Participant1 = board members
         Action = meet with
         Participant2 = senior managers
  
         Sentence = Boris Johnson met with [...]
         Participant1 = Boris Johnson
         Action = met with
         Participant2 = the Queen
  
         Sentence = Donald Trump meets [...]
         Participant1 = Donald Trump
         Action = meets
         Participant2 = the Queen
        
         Sentence = The two leaders also [...]
         Participant1 =
         Action =
         Participant2 =
  

Note that the code correctly identifies the participants of the meeting event in each case and returns nothing for the last sentence that doesn’t describe a meeting event.

Congratulations! You built your first information extraction algorithm. Now try to use it in practice.

Exercise:

Apply the information extraction algorithm to your own data to extract the information about all meetings that took place between different participants. Alternatively, apply it to a different type of events expressed with verbs other than “meet”.

That’s all for this article.

If you want to learn more about the book, you can preview its contents on our browser-based liveBook platform here.


[1] Check out the different language models available for use with spaCy: https://spacy.io/models/en. Small model (en_core_web_sm) is suitable for most purposes and it’s more efficient to upload and use, but larger models like en_core_web_md (medium) and en_core_web_lg (large) are more powerful and some NLP tasks require the use of such larger models.

[2] Check out documentation on Python’s functions here: https://docs.python.org/3/library/functions.html

[3] Check out string formatting techniques in Python 3: https://docs.python.org/3/library/string.html

[4] See the description of different relation types on https://spacy.io/api/annotation#dependency-parsing.

[5] To find out more about the tool, check https://spacy.io/usage/visualizers.