Description

From Mastering Large Datasets with Python by J.T. Wolohan

This article covers

  • Using map to do complex data transformations
  • Chaining together small functions into pipelines
  • Applying these pipelines in parallel on large datasets


Take 37% off Mastering Large Datasets with Python. Just enter fccwolohan into the discount code box at checkout at manning.com.


You can use map to replace for loops and how using map makes parallel computing straightforward: a small modification to map, and Python takes care of the rest. If we’re going to make parallel programming more useful, we’re going to want to use map in more complex ways. This article introduces how to do complex things with map.

Specifically, we’re going to introduce two new concepts in this article:

  1. Helper functions
  2. Function chains (also known as pipelines)

We’ll tackle those topics by looking at two examples. In the first, we decode the secret messages of a malicious group of hackers. In the second, we help our company do demographic profiling on its social-media followers. Ultimately, though, we solve both of these problems the same way: by creating function chains out of small helper functions.

Helper functions and function chains

Helper functions are small, simple functions that we rely on to do complex things. If you’ve heard the (rather gross) saying “the best way to eat an elephant is one bite at a time”, then you’re already familiar with the idea of helper functions. With helper functions, we can break down large problems into small pieces which we can code quickly. In fact, let’s put forth this as a possible adage for programmers: “the best way to solve a complex problem is one helper function at a time.”

Function chains or pipelines are the same thing but different people favor one term or the other—I’m going to use both terms interchangeably to keep from overusing either. This is the way helper functions are put to work. For example, if we’re baking a cake (a complex task for the baking challenged among us), we’ll want to break that process up into lots of small steps:

  • Add flour
  • Add sugar
  • Add shortening
  • Mix the ingredients
  • Put in the oven
  • Take the cake from the oven
  • Let the cake set
  • Frost the cake

Each of these steps is small and easily understood. These are our helper functions. None of these helper functions by themselves can take us from having raw ingredients to having a cake. We need to chain these actions (functions) together to bake the cake. Another way of saying this is we need to pass the ingredients through our cake making pipeline, along which they’ll be transformed.

To put this another way, let’s take a look at our simple map statement again, this time in figure 1.


Figure 1 The standard map statement shows how we can apply a single function to several values to return a sequence of values transformed by the function.


As we’ve seen many times, our input values are on the left, we’ve a function which we’re passing these values through, and on the right is our output values. In this case, x+7 is our helper function. x+7 “does the work” in this situation, not map. map applies the helper function to all of our input values and provides us with output values, but on its own, it doesn’t do us much good. We need a specific output, and for that we need x+7.

It’s also worth taking a look at function chains, sequences of (relatively) small functions which we apply one after another. These also have their basis in math. We get them from a rule that mathematicians call function composition.

Function composition says that a complex function like j(x) = ((x+7)2–2)*5 is the same as smaller functions that each do one piece chained together. For example, if we had four functions:

  1. f(x) = x+7
  2. g(x) = x2
  3. h(x) = x – 2
  4. i(x) = x * 5

We could chain them together as i(h(g(f(x)))) and have that equal j(x). We can see that play out in figure 3.


Figure 2 Function composition says that if we apply a series of functions in sequence then it’s the same as if we applied them all together as a single function.


As we move through the pipeline in Figure 2, we can see our four helper functions: f, g, h and i. We can see what happens as we input 3 for x into these functions chained together. First, we apply f to x and get 10 (3+7), then g to 10 and get 100 (102), h to 100 and get 98 (100-2), and lastly, we apply i and get 490 (98 * 5). The resulting value is the same as if we had input 3 into our original function j.

With these two simple ideas—helper functions and pipelines—we can achieve complex results. In this article, you’ll learn how to implement these two ideas in Python. Specifically, we’re going to explore the power of these ideas in two scenarios:

  1. Cracking a secret code
  2. Predicting the demographics of social media followers

Unmasking Hacker Communications

Situation A malicious group of hackers has started using numbers in place of common characters and Chinese characters to separate words in order to foil automated attempts to spy on them. In order to read their communications—and find out what they’re saying—we need to write some code that will undo their trickery. Let’s write a script that turns their hacker speak into a list of English words.

We’re going to solve this problem by starting with map. Specifically, we’re going to use the idea of map to set up the big picture data transformation that we’re doing. For that, we’ll visualize the problem in figure 3.


Figure 3 Our hacker problem can be expressed as a map transformation where we have hard-to-read hacker messages as input, but after we clean it with our hacker_translate function, it becomes plain English text.


On the top is our input values. We can see that these are some difficult to read hacker communications and at first glance they don’t make a lot of sense. In the middle we have our map statement and our hacker_translate function. This is going to be our heavy lifter function. It does the work of cleaning the texts. And finally, on the bottom are our outputs: plain English.

Now, this problem isn’t a simple problem; it’s more like baking a cake. To accomplish it lets split it up into several smaller problems which we can solve easily. For example, for any given hacker string, we’ll want to:

  • Replace all the 7s with “t”s
  • Replace all the 3s with “e”s
  • Replace all the 4s with “a”s
  • Replace all the 6s with “g”s
  • Replace all the Chinese characters with spaces

If we can do these five things for each string of hacker text, we’ll have our desired result of plain English text. Before we write any code, let’s take a look at how these functions transform our text. First, we’ll start with replacing the 7s with “t”s in figure 5.


Figure 4 Part of our hacker translate pipeline is going to involve replacing 7s with “t”s, we’ll accomplish that by mapping a function that does that to all our inputs.


On the top of figure 4 we see our unchanged input texts: garbled un-readable hacker communications. In the middle’s our function replace_7t, which replaces all the 7s with “t”s, and on the bottom, notice the lack of 7s in our text anywhere. This makes our texts a little more readable.

Moving on, we’ll replace all the 3s in all the hacker communications with “e”s. We can see that in figure 6.


Figure 5 The second step in our hacker translate pipeline is going to involve replacing 3s with “e”s; we’ll take care of this by mapping a function which does that across out inputs.


On the top in figure 5 we see our slightly cleaned hacker texts: we’ve already replaced the 7s with “t”s. In the middle’s our replace_3e function which works to replace the 3s with “e”s. On the bottom is our now more readable text. You’ll notice all the 3s are gone and we have some “e”s.

Continuing on, we’ll do the same thing with 4s and “a”s and 6s and “g”s, until we’ve removed all our letters. We’ll skip discussing those functions to avoid repetition. Once we’ve completed those steps, we’re ready to tackle those Chinese characters. We can see that in figure 7.


Figure 6 Subbing on Chinese characters is the last step in our hacker_translate function chain and we can tackle it with a map statement.


In figure 6, we see on the top it’s mostly-English sentences with Chinese characters smushing the words together. In the middle is our splitting function: sub_chinese. And on the bottom, finally, is our fully cleaned sentences.

Creating helper functions

Now that we’ve sketched out our solution, let’s start writing some code. First, we’ll write all our replacement helper functions.

We’ll write all of these functions at once because they follow a similar pattern: we take a string, find all of some character (a number) and replace it with some other character (a letter). For example, in replace_7t, we find all of the 7s and replace them with “t”s. We do this with the built-in Python string method replace. Replace allows us to specify the parameters of the characters we want to remove and the characters we want to replace them with.

Listing 1 Replacement helper functions

  
 def replace_7t(s): #A
     return s.replace('7','t')
 def replace_3e(s): #B
     return s.replace('3','e')
 def replace_6g(s): #C
     return s.replace('6','g')
 def replace_4a(s): #D
     return s.replace('4','a')
  

#A Replace all the 7s with “t”s

#B Replace all the 3s with “e”s

#C Replace all the 6s with “g”s

#D Replace all the 4s with “a”s

This takes care of the first handful of steps. Now we want to split on the Chinese text. This task’s a little more involved because the hackers are using different Chinese characters to represent spaces, not the same one again and again; we can’t use replace here. We have to use a regular expression. Because we’re using a regular expression, we want to create a small class that can compile this regular expression ahead of time. In this case, our sub_chinese function’s going to be a class method. We’ll see that play out in listing 2.

Listing 2 Split on Chinese function

 
 import re
  
 class chinese_matcher: #A
   
     def __init__(self):
         self.r = re.compile(r'[\u4e00-\u9fff]+') #B
   
     def sub_chinese(self,s):
         return self.r.sub(s, “ “) #C

#A We compile our regular expression on initialization of the class

#B In this case, we want to match one or more Chinese characters. Those characters can be found in the Unicode range from 4e00 to u9fff.

#C Now, we can use this compiled regular expression in a method using the regular expression’s split method.

The first thing we do is create a class called chinese_matcher. Upon initialization, that class is going to compile a regular expression that matches all the Chinese characters. That regular expression is going to be a range regular expression that looks up the Unicode characters between \u4e00 (the first Chinese character in the Unicode standard) and \u9fff (the last Chinese character in the Unicode standard). If you’ve used regular expressions before, you should already be familiar with this concept for matching capital letters with regular expressions like [A-Z]+ which matches one or more uppercase English characters. We’re using the same concept here, except instead of matching uppercase characters we’re matching Chinese characters. Instead of typing in the characters directly, we’re typing in their Unicode numbers.

Having set up that regular expression, we can use it in a method. In this case, we’ll use it in a method called sub_chinese. This method’s going to apply the regular expression method split to an arbitrary string and return the results. Because we know our regular expression matches one or more Chinese characters, this will be the result every time there’s a Chinese character in the string; we’ll add a space there.

Creating a pipeline

Now we have all of our helper functions ready. We’re ready to bake our hacker-foiling cake. The next thing to do is chain these helper functions together. Let’s take a look at three ways to do this:

  1. Using a sequence of maps
  2. Chaining functions together with compose
  3. Creating a function pipeline with pipe

A sequence of maps

In Listing 3 we take all of our functions and map them across the results of one another.

  • We map replace_7t across our sample messages
  • Then we map replace_3e across the results of that
  • Then we map replace_6g across the results of that
  • Then we map replace_4a across the results of that
  • Finally we map C.sub_chinese

This solution isn’t pretty, but it works. If you print the results, you’ll see all of our garbled sample sentences translated into easily readable English, with the words split apart from one another: exactly what we wanted. Remember, you need to evaluate map before you can print it!

Listing 3 Chaining functions by sequencing maps

  
 C = chinese_matcher()
  
 map( C.sub_chinese,
         map(replace_4a,
             map(replace_6g,
                 map(replace_3e,
                     map(replace_7t, sample_messages)))))
  

Constructing a pipeline with compose

Now, although we can chain our functions together in this way, there are better ways. We’ll take a look at two functions which can help us do this:

  1. compose
  2. pipe

Each of these functions is in the toolz package, which you can install with pip like most python packages: pip install toolz.

First, let’s look at compose. compose is a function which takes our helper functions in the reverse order that we want them applied and returns a function that applies them in the desired order. For example, compose(foo, bar, bizz) applies bizz, then bar, then foo. In the specific context of our problem, that looks like Listing 4.

Here, we call the compose function and pass it all the functions we want to include in our pipeline. We pass them in reverse order because compose is going to apply them backwards. We store the output of our compose function, which is itself a function, to a variable. And then we can call that variable or pass it along to map, which applies it to all the sample messages.

Listing 4 Using compose to create a function pipeline

  
 from toolz.functoolz import compose
  
 hacker_translate = compose(C.sub_chinese, replace_4a, replace_6g,
                            replace_3e, replace_7t)
  
 map(hacker_translate, sample_messages)
  

If you print this, you’ll notice that the results are the same as when we chained our functions together with a sequence of map statements. The major difference is that we’ve clean up our code quite a bit and here we only have one map statement.

Pipelines with pipe

Next, let’s look at pipe. pipe is a function which passes a value through a pipeline. It expects the value to pass and the functions to apply to it. Unlike compose, pipe expects the functions to be in the order we want to apply them. pipe(x, foo, bar, bizz) applies foo to x, then bar to that value, and finally bizz. Another important difference between compose and pipe is that pipe evaluates each of the functions and returns a result, and if we want to pass it to map, we need to wrap it in a function definition. Again, turning to our specific example which looks something like Listing 5.

Listing 5 Using pipe to create a function pipeline

  
 from toolz.functoolz import pipe
  
 def hacker_translate(s):
         return pipe(s, replace_7t,replace_3e,replace_6g,
                        replace_4a, C.sub_chinese)
  
     map(hacker_translate,sample_messages)
  

Here, we create a function that takes our input and returns that value after it has been “pipped” through a sequence of functions which we pass to pipe as parameters. In this case, we’re starting with replace_7t, then applying replace_3e, then applying replace_6g, then applying replace_4a, and lastly applying C.sub_chinese. The result here, as with compose, is the same as when we chained the functions together using a sequence of maps—you’re free to print out the results and prove this to yourself—but the way we get there’s a lot cleaner.

Two major advantages to creating pipelines of helper functions are:

  • The code becomes readable and clear
  • The code becomes modular and easy to edit

The former, increasing readability, is true when we have to do complex data transformations or when we want to do a sequence of possibly related, possibly unrelated actions. For example, having been introduced to the notion of compose, I’m confident you could make a guess at what this pipeline does:

  
 my_pipeline = compose(reverse, remove_vowels, make_uppercase)
  

The latter, making code modular and easy to edit, is a major perk when we’re dealing with dynamic situations. For example, let’s say our hacker adversaries change their ruse and are now replacing even more letters! We could add new functions into our pipeline to adjust. If we find that the hackers stop replacing a letter, we can remove that function from the pipeline.

A hacker translate pipeline

Lastly, let’s return to our map example of this problem. At the beginning, we’d hoped for one function, hacker_translate, which took us from garbled hacker secrets to plain English. What we did can be seen in figure 8.


Figure 7 We can solve the hacker translation problem by constructing a chain of functions which each solve one part of the problem.


Figure 7 shows our input values up top, our output values on the bottom, and through the middle we see how our five helper functions change our inputs. Breaking our complicated problem into several small problems made coding the solution rather straightforward and with map, we can easily apply the pipeline to any number of inputs that we need.

Twitter demographic projections

We just looked at how to foil a group of hackers by chaining small functions together and applying them across all the hackers’ messages. Now, let’s dive even deeper into what can be done all by using small, simple helper functions chained together.

Scenario The head of marketing has a theory that male customers are more likely to engage with your product on social media than female customers and has asked you to write an algorithm to predict the gender of Twitter users mentioning their product based on the text of their posts. The marketing head has provided you with a list of TweetIDs for each customer. You have to write a script that turns these lists of IDs into both a score representing how strongly we believe them to be of a given gender and a prediction about their gender.

To tackle this problem, again, we’re going to start with a big picture map diagram.


Figure 8 The map diagram for our gender_prediction_pipeline demonstrates the beginning and end of the problem: we’ll take a list of tweet IDs and convert them into predictions about a user.


The map diagram in figure 1 allows us to see our input data on the top and our output data on the bottom, which helps us think about how to solve the problem. On the top, we can see a sequence of lists of numbers, each representing a tweet ID. This is our input format. On the bottom we see a sequence of dicts, each with a key for “score” and “gender”. This gives us a sense of what we need to do with our function gender_prediction_pipeline.

Now, predicting the gender of a Twitter user from several tweet IDs isn’t one task: it’s several tasks. To accomplish this, we’re going to have to:

  • Retrieve the tweets represented by those ids
  • Extract the tweet text from those tweets
  • Tokenize the texts
  • Score the tokens
  • Score users based on their Tweet-scores
  • Categorize the users based on their score

Looking at the list above, we can break down our process into two transformations: transformations which are happening at the user level and transformations which are happening at the tweet level. The user level transformations include things like scoring the user and categorizing the user. The tweet level transformations include things like retrieving the tweet, retrieving the text, tokenizing the text and scoring the text. If we were still working with for loops, this type of situation means that we’d need a nested for loop. Because we’re working with map, weneed a map inside of our map.

Tweet-level pipeline

Let’s look at our tweet-level transformation first. At the tweet-level, we’re going to convert a Tweet ID into a single score for that tweet, representing the gender-score of that tweet. We’ll score the tweets by giving them points based on the words they use. Some words make the tweet more of a “man’s tweet”, and some make the tweet more of a “woman’s tweet”. We can see this process playing out in figure 2.

In this article, we only approximate the real thing, but you can find a state-of-the art classifier on my GitHub page: https://github.com/jtwool/TwitterGenderPredictor


Figure 9 We can chain together four functions into a pipeline that accomplishes each of the sub-parts of our problem.


Figure 9 shows the several transformations that our tweets undertake as we transform them from ID to score. Starting in the top left, we start with tweet IDs as an input, then we pass them through a get_tweet_from_id function and we get tweet objects back. Next, we pass these tweet objects through a tweet_to_text function, which turns the tweet objects into the text of that tweet. Then, we tokenize the tweet by applying our tokenize_text function. After that, we’ll score the tweet with our score_text function.

Turning our attention to user-level transformations, the process here’s simpler.

  1. We apply the tweet-level process to each of the user’s tweets
  2. We take the average of the resulting tweet scores to get our user-level score
  3. We categorize the user as either “male” or “female”

 


Figure 10 Small functions can be chained together to turn lists of user’s tweet IDs into scores, then into averages, and finally, into predictions about their demographics.


We can see that each user starts as a list of tweet IDs. Applying our score_user function, across all of these lists of tweet IDs, we get back a single score for each user. Then, we can use the categorize_user function to turn this score into a dict that includes both the score and the predicted gender of the user, like we wanted at the outset.

These map diagrams give us a roadmap for writing our code. They help us see what data transformations need to take place and where we’re able to construct pipelines. For example, we now know that we need two function chains: one for the tweets and one for the users. With that in mind, let’s start tackling the tweet pipeline.

Our tweet pipeline’s going to consist of four functions. Let’s tackle them in this order:

  1. get_tweet_from_id
  2. tweet_to_text
  3. tokenize_text
  4. score_text
Because this scenario involves Twitter scraping, the automated collection of Twitter data, I’d like to offer you the opportunity to do real Twitter scraping. Doing this requires you to request a Twitter developer account. These developer accounts used to be much easier to get. Twitter’s beginning to restrict who can develop on their platform in order to crack down on bots. If you don’t want to sign up for Twitter, you don’t want to sign up for a developer account, or you don’t want to wait, you can proceed without signing up for a developer account.

Our get_tweet_from_id function is responsible for taking a Tweet ID as input, looking up that Tweet ID on Twitter and returning a Tweet object that we can use. The easiest way to scrape Twitter data is going to be to use the python-twitter package. You can install python-twitter easily with pip:

  
 pip install python-twitter
  

Once you have python-twitter set up, you’ll need to set up a developer account with Twitter. You can do that at https://developer.twitter.com/. If you have a Twitter account already, there’s no need to create another account; you can sign in with the account you already have. With your account set up, you’re ready to apply for an app. You need to fill out an application form, and if you tell Twitter that you’re using this article to learn parallel programming, they’ll be happy to give you an account. When you’re prompted to describe your use case, I suggest entering the following:

The core purpose of my app is to learn parallel programming techniques. I am following along with a scenario provided in chapter 3 of Practical, Parallel Python by JT Wolohan, published by Manning Publications.

I intend to do a lexical analysis of fewer than 50 Tweets, selected at random by the author.

I do not plan on using my app to Tweet, Retweet, or “like” content.

I will not display any Tweets anywhere online.

Once your Twitter developer account is set up and confirmed by Twitter (this may take an hour or two), you’ll navigate to your app and find your consumer key, your consumer secret, your access token key, and your access token secret. These are the credentials for your app. They tell Twitter to associate your requests with your app.

With your developer account on ready and python-twitter installed, we’re finally ready to start coding our tweet-level pipeline.

The first thing we do is import the python-twitter library. This is the library we installed. It provides a whole host of convenient functions for working with the Twitter API. Before we can use any of those nice functions we need to authenticate our app. We authenticate our app by initiating an Api class from library. The class takes our application credentials, which we get from the Twitter developers’ website, and uses them when it makes calls to the Twitter API.

With this class ready to go, we can then create a function to return tweets from Twitter IDs. We’ll need to pass our API object to this function in order to use it to make the requests to Twitter. Once we do this, we can use the API object’s GetStatus method to retrieve Tweets by their ID. Tweets retrieved in this way come back as Python objects, perfect for using in our script.

We’ll use that fact in our next function tweet_to_text, which takes the tweet object and returns its text. This function is short. It calls the text property of our tweet object and returns that value. The text property of tweet objects returned by python-twitter contain, as we’d expect, the text of the tweets.

With the tweet text ready, we can tokenize it. Tokenization is a process in which we break up a larger text into smaller units that we’re able analyze. In some cases, this can be pretty complicated, but for our purpose we’ll split on whitespace to separate words from one another. For a sentence like “This is a tweet”, we’d get a list containing each word: [“This”, ”is”, ”a”, ”tweet”] We’ll use the built-in string split method to do that.

Once we have our tokens, we need to score them. For that, we’ll use our score_text function. This function’s going to look up each token in a lexicon, retrieve its score, and then add all of those scores together to get an overall score of the tweet. To do that, we need a lexicon, a list of words and their associated scores. We’ll use a dict to accomplish that here. To look up the scores for each word, we can map the dicts get method across the list of words.

get is a dict method that allows us to lookup a key and provide a default value in case we don’t find it. This is useful in our case because we want words that we don’t find in our lexicon to have a neutral value of zero.

In order to turn this method into a function we use what’s called a lambda function. The lambda keyword allows us to specify variables and how we want to transform those variables. For example, lambda x: x+2 defines a function which adds two to whatever value is passed to it. lambda x: lexicon.get(x, 0) looks up whatever it is passed in our lexicon and returns either the value or 0 if it doesn’t find anything. We’ll often use it for short functions.

Finally, with all of those helper functions written, we can construct our score_tweet pipeline. This pipeline’s going to take a tweet ID, pass it through all of these helper functions and return the result. For this, we’ll use the pipe function from the toolz library. This pipeline represents the entirety of what we want to do at the tweet-level. All of this code can be seen in listing 1.

Listing 6 Tweet-level pipeline

  
 from toolz import pipe #A
 import twitter
  
 Twitter = twitter.Api(consumer_key="", #B
                       consumer_secret="",
                       access_token_key="",
                       access_token_secret="")
  
 def get_tweet_from_id(tweet_id, api=Twitter): #C
     return api.GetStatus(tweet_id, trim_user=True)
  
 def tweet_to_text(tweet): #D
     return tweet.text
  
 def tokenize_text(text): #E
     return text.split()
  
 def score_text(tokens): #F
     lexicon = {"the":1, "to":1, "and":1, #Words with 1 indicate men #G
              "in":1, "have":1, "it":1,
              "be":-1, "of":-1, "a":-1, # Words with -1 indicate women
              "that":-1, "i":-1, "for":-1}
     return sum(map(lambda x: lexicon.get(x, 0), tokens)) #H
  
 def score_tweet(tweet_id): #I
     return pipe(tweet_id, get_tweet_from_id, tweet_to_text,
                           tokenize_text, score_text)
  

#A Import the python-twitter library

#B Authenticate our app

#C Use our app to look up tweets by their ID

#D Get the text from a tweet object

#E Split text on whitespace to analyze words

#F Create our score text function

#G Create a mini-sample lexicon for scoring words

#H Replace each word with its point value

#I Pipe a tweet through our pipeline

User-level pipeline

Having constructed our tweet-level pipeline, we’re ready to construct our use-level pipeline. As we laid out previously, we’re going to need to do three things for our user-level pipeline.

  1. We’ll need to apply the tweet pipeline to all the user’s tweets
  2. We’ll need to take the average of the score of those tweets
  3. We’ll need to categorize the user based on that average

For concision, we’ll collapse actions one and two into a single function, and let number three be a function all on its own. When all’s said and done, our user-level helper functions are going to look like listing 2.

Listing 7 User-level helper functions

  
 from toolz import compose
  
 def score_user(tweets): #A
     N = len(tweets) #B
     total = sum(map(score_tweet, tweets)) #C
     return total/N #D
  
 def categorize_user(user_score): #E
     if user_score > 0: #F
         return {"score":user_score,
                 "gender": "Male"}
 return {"score":user_score, #G
         "gender":"Female"}
  
 gender_prediction_pipeline = compose(categorize_user, score_user)#H
  

#A Our score_user function averages the scores of all a user’s tweets

#B We first find the number of tweets

#C Then we find the sum total of all a user’s individual tweet scores

#D Finally we return the sum total divided by the number of tweets

#E Our categorize_user function takes the score and returns a predicted gender as well

#F If the user_score is greater than 0, we’ll say that the user is male

#G Otherwise, we’ll say the user is female

#H Lastly, we compose these helper functions into a pipeline function

In our first user-level helper function we need to accomplish two things: we need to score all of the user’s tweets, and then we need to find the average of them. We already know how to score their tweets: we built a pipeline for that exact purpose! To score the tweets, we’ll map that pipeline across all the tweets. We don’t need the scores themselves, we need the average.

To find a simple average, we want to take the sum of the values and divide it by the number of values that we’re summing. To find the sum, we can use Python’s built-in sum function on the tweets. To find the number of tweets, we can find the length of the list with the len function. With these two values ready, we can calculate the average by dividing the sum by the length.

This gives us an average tweet score for every user. With that we can categorize the user as being either “Male” or “Female”. To make that categorization, we’ll create another small helper function: categorize_user. This function checks to see if the user’s average score is greater than zero. If it is, it returns a dict with the score and a gender prediction of “Male”. If their average score is zero or less it returns a dict with the score and a gender prediction of “Female”.

These two quick helper functions are all we’ll need for our user-level pipeline. Now we can compose them, remembering to supply them in reverse order than we want to apply them. That means we put our categorization function first, because we’re using it last, and our scoring function last, because we’re using it first. The result is a new function—gender_prediction_pipeline—that we can use to make gender predictions about a user.

Applying the pipeline

Now that we have both our user-level and tweet-level function chains ready, all you have left to do is apply the functions to our data. To apply this to our data we can either use Tweet IDs with our full tweet-level function chain, or—if you decided not to sign-up for a Twitter developer account—we can use the text of the tweets. If you’re going to be using the tweet text, make sure to create a tweet-level function chain (score_tweet) that omits the get_tweet_from_id and tweet_to_text functions.

Applying the pipeline to tweet IDs

Applying our pipelines in the first instance might look something like listing 3. We start by initializing our data. The data we’re starting with is four lists of five tweet IDs. Each of the four lists represents a user. The tweet IDs don’t come from the same user; they’re real tweets, randomly sampled from the internet.

Listing 8 Applying the gender prediction pipeline to tweet IDs

  
 users_tweets = [ #A
 [1056365937547534341, 1056310126255034368, 1055985345341251584,
  1056585873989394432, 1056585871623966720],
 [1055986452612419584, 1056318330037002240, 1055957256162942977,
  1056585921154420736, 1056585896898805766],
 [1056240773572771841, 1056184836900175874, 1056367465477951490,
  1056585972765224960, 1056585968155684864],
 [1056452187897786368, 1056314736546115584, 1055172336062816258,
  1056585983175602176, 1056585980881207297]]
  
 with Pool() as P: #B
     print(P.map(gender_prediction_pipeline, users_tweets))
  

#A First we need to initialize our data. Here, we’re using four sets of tweet IDs.

#B Then we can apply our pipeline to our data with map. Here we’re using a parallel map.

With our data initialized, we can now apply our gender_prediction_pipeline. We’re doing it with a parallel map. To do that, we first call Pool to gather up some processors, then we use the map method of that Pool to apply our prediction function in parallel.

If we’re doing this in an industry setting, this is an excellent opportunity to use a parallel map for two reasons.

  1. We’re doing what amounts to the same task for each user
  2. Both retrieving the data from the web and finding the scores of all those tweets are relatively time and memory consuming operations

To the first point, whenever we find ourselves doing the same thing over and over again, we should think about using parallelization to speed up our work. This is true if we’re working on a dedicated machine (like our personal laptop or a dedicated compute cluster) and don’t need to concern ourselves with hoarding processing resources other people or applications may need.

To the second point, we’re best off using parallel techniques in situations where the calculations are at least somewhat difficult or time consuming. If the work we’re trying to do in parallel is too easy, we may spend more time dividing the work and reassembling the results than we’d doing it in a standard linear fashion.

Applying the pipeline to tweet text

Applying the pipeline to tweet text directly is going to look similar to applying the pipeline to tweet IDs.

Listing 9 Applying the gender prediction pipeline to Tweet text

  
 #A
 user_tweets = [
         ["i think product x is so great", "i use product x for everything",
         "i couldn't be happier with product x"],
         ["i have to throw product x in the trash",
         "product x... the worst value for your money"],
         ["product x is mostly fine", "i have no opinion of product x"]]
  
 #B
 with Pool() as P:
     print(P.map(gender_prediction_pipeline, users_tweets))
  

The only change to Listing 4 versus Listing 3 is our input data. Instead of having tweet IDs that we want to find on Twitter, retrieve and score, we can score the tweet text directly. Because we’ve modified our score_tweet function chain to remove the get_tweet_from_id and tweet_to_text helper functions, the gender_prediction_pipeline works exactly as we want.

That it’s easy to modify our pipelines is one of the major reasons we want to assemble pipelines in the first place. When conditions change, as they often do, we can quickly and easily modify our code to respond to them. We could even create two function chains if we envisioned handling both situations. One function chain could be score_tweet_from_text and work on tweets provided in text form. Another function chain could be score_tweet_from_id and categorize tweets provided in tweet ID form.

Looking back throughout this example, we created six helper functions and two pipelines. For those pipelines, we used both the pipe function and the compose function from the toolz package. We also used this with a parallel map to pull down tweets from the internet in parallel. Using helper functions and function chains makes our code easy to understand, easy to modify, and plays nicely with our parallel map, which wants to apply the same function over and over again.

That’s all for now. If you want to learn more about the book, check it out on liveBook here and see this slide deck.