From Advanced Algorithms and Data Structures, by Marcello La Rocca
This articles uses the example of a well-known family of animated ducks to explain MapReduce.
For a long time, engineers struggled to efficiently parallelize execution of many computational-intensive algorithms, because of both hardware limits and lack of software infrastructure. The most used distributed programming model was GRID computing, until it was realized that a different approach would make computation not only faster, but potentially more robust; that’s when the MapReduce programming model was adopted on a large scale, after being patented and used by Google in the early 2000s.
Although there are several implementations of MapReduce (more accurately, several products using MapReduce to provide tools that orchestrate distributed resources to solve tasks, like Apache Hadoop, Hive or CloudDB), I believe its main value is in the model it provides, which can be applied to a plethora of tasks.
And, for this reason, we’ll try to explain how it works through an example.
Parallel vs Distributed
Before we get to the grain, though, a disclaimer is due: usually with the term parallel computing we only address computations which run on multiple CPUs on the same system – multi-threading, in synthesis; when we think about using multiple CPUs across several machines communicating through a network, then we’re referring to what’s called distributed computing. Figure 1 shows a diagram that illustrates this difference.
Parallel computing is limited by the number of CPUs on a single machine, while distributed computing is a better approach to scale out systems and process huge datasets. On the other hand, if a dataset can fit into a single machine’s memory, the parallel computing approach result is sensibly faster, because the processes can communicate through shared memory, while nodes in distributed systems need to exchange information through a network (at the time of writing, the latency is 100ns vs 150ms; a factor 106)
In this article, we’ll often use the term parallel computer to refer to both: the computational models we present are software abstractions which could run seamlessly on threads on a single machine or on a distributed system, and the only discriminant would be the size of the input and the resources needed, not the algorithms we use.
Figure 1. A schematic view of the difference between parallel and distributed computing models.
Now we’re ready to delve into our example!
Imagine You Are Donald Duck…
Imagine Donald Duck dozing on his hammock – as always – in a lazy afternoon, when suddenly a phone rings louder than normal (and more annoying than normal!) and wakes him up: he knows even before answering that he is being gently summoned by his lovely uncle Scrooge McDuck and he needs to rush to the Money Bin—a kind request to which he gladly answers, in order to avoid being disowned (and overwhelmed by debts).
Long story short, as always, his good old uncle Scrooge has a long boring task for Donald to attend; this time, because he is securing the Money Bin main room and has to move all the coins to a different (giant) safe, he wants to take advantage of the situation (surprise surprise!) and count and catalogue all the coins in his Money Bin… by the next morning.
We’re talking about millions of coins, and it would be humanly impossible to make it on one’s own. When Donald Duck regains his senses (he understandably fainted when uncle Scrooge broke the news) he figures out he’ll need all the help he can get, and he runs to Gyro Gearloose’s and convinces him to create a hoard of robo-clones which can learn to recognize different coins and catalogue them.
Figure 2. A first attempt to parallelize coin counting. In this configuration, poor Donald still has to sum hundreds of lists with hundreds of entries each.
This step, is the “classical” parallelization step: you break the work down (into piles of coins) to several copies of your software (the counting/catalogue routine) and write down, for each pile, which coins you found and how many there are; for instance, a machine could produce a list like this:
£1: 1034 pieces 50¢: 53982 pieces 20p: 679 pieces $1: 11823 pieces 1¢: 321 pieces
Problem solved? Well… not really. Robo-clones are expensive and take time to build, and even a genius like Gyro could only provide a hundred of them by quickly rewiring some not- well-behaving robo-waiters he created in one of his experiments. Now they became quite fast at counting money, but each of them has a huge pile of coins, resulting in a long list of coin types with their quantities. Figure 2 illustrates the situation: once they finished, it’s up to Donald to add up the hundreds of entries in those lists.
After fainting again and being woken up using Uncle Scrooge’s ammonia (he’ll be charged for it, it goes without saying) good old Donald crawls to Gyro’s lab for a desperate cry for (more) help.
Unfortunately, Gyro can’t afford to build more counting machines, but he wouldn’t be a genius if he couldn’t solve this problem.
To do this, he won’t have to build anything, all he needs is some help using a different algorithm will do. After doing a quick computation in his head, he estimates that there are about two hundred different type of coins. He rounds up the whole family, and gives a task to each of them: they must handle five types of coins each, but they won’t have to count them: they’ll receive a hundred lists from the counting machines, and each list only has five entries for the same five types of coins, including those found by the individual counting machine.
Figure 3. Revised coin counting process, using MapReduce, and a little help. Now each counting machine produces several lists, one for each member of the McDuck family at level two. They, in turn, need to check a hundred lists, but each list has only a few entries, and sum up the values in each list by entry.
To achieve this, he provides each counting machine with an address to locate each member of the McDuck family—for instance an email address, like [email protected]—a dictionary which is the same for each machine and which lists the type of coins handled by each member of the family. To simplify things, we can imagine that each member is assigned all the coins from a single Country: for instance, as shown in figure 3, Huey could get all US dollars, Dewey all of UK’s sterling pounds, Louie all Euros and so on. In a real application, each of them could get any combination of coin denominations.
Once a machine is done counting, it goes through the members of the family and sends them an email with the total number of coins found for each of the types they’re responsible for. Then each family member must sum the values in each list, for each type of coin, and send the final result to Rncle Scrooge –a few hundred integer additions per-duck, a tedious job maybe, but it shouldn’t take too long (make sure not to let Fethry anywhere near a computer!).
For instance, if Daisy Duck is assigned the $1, 50¢, 25¢ and 1¢ coins, then all the machines will send her a short list that looks like this:
25¢: 1.034 pieces $1: 11823 pieces 50¢: 53982 pieces 1¢: 321 pieces
Figure 3 shows the shift in paradigm: before, whoever had to make sense of all the lists became the bottleneck of the computation, but now, introducing a new intermediate level in the workflow and breaking down the work, limiting the amount of work each entity at this level must do, and it makes the difference.
The key, though, is that the results outputted at level one can be partitioned in groups and each of these groups can then be handled separately at level two.
First Map, Then Reduce
Time to abandon our comics’ heroes for a more life-like example, where both levels of this parallel computation are performed by machines. The operation at level one is called Map, because it maps each entry in the input dataset (more exactly, in the portion of the dataset handled by a machine) into something else, extracting the information which is relevant for computing the final result; the counting machine, in our example, could likely run a “coin-recognition” software, without keeping a count, and send, to the machines at level two, lists containing unsorted occurrences of coins, something like:
$100: 1 50¢: 1 $100: 1 $1: 1 25¢: 1 …
Here, the info extracted by mappers is only the presence of a coin.
Then the machines at level two, instead, specialize in counting. Every machine at level two receives all the entries for occurrences of a certain group of coins, and does something with them (counts them, for example, but it could also sum up their values or filter them). This step is called Reduce, because it takes info limited to a homogeneous group of entries, and combines (aka reduces) them to get our final result.
As mentioned, the key disadvantage of the classic, “flat” parallel computation is that composing the results of all the parallel threads/processes/machines creates the bottleneck of the whole process: if a single process has to spawn the threads and then get their results and combine them, it still needs to sequentially access the whole dataset at least once, and even if it also parallelizes the process which combines the intermediate results, it still remains a bottleneck, as shown in figure 4: on the left half, you can see how, for basic parallelism, the “intermediate output” is all sent to the orchestrator, which gathers it all and sorts it to the machines in layer two.
In MapReduce every step is intrinsically parallel: data is already broken down into pieces which can be processed independently, and results are routed to the reducers by each mapper independently, without passing through a central orchestrator.
Well, to be exact, technically the reducers are the ones which read the information from each mapper, while the mappers’ task is to create temporary files for each reducer in a specific location. They’re different for each reducer: imagine that each mapper creates a different folder or a dedicated virtual disk for each reducer.
Besides speed, MapReduce approach has another advantage: if a machine crashes, that single machine can be replaced without having to restart the whole computation. This, in turn, can also help us improve increase availability and reduce latency by allocating redundant resources to preventively cope with malfunctions.
One thing needs to be clear: MapReduce has an orchestrator, a master which controls the computation (spinning up the computational nodes, or requesting existing resources, assigning the input chunks to the masters, planning how to route intermediate results to reducers, and handling/recovering from errors). The difference with canonical parallelism, though, is that this special node neither reads the input nor computes it, and that intermediate result doesn’t have to pass through it, and it’s not a bottleneck for computation. An objection could be that the master node is still a bottleneck for availability, because if the master node crashes the computation can’t be completed: using replicas (either live copies of master-slave), but we could get availability guarantees through (limited) redundancy.
The first catch is that not all computations are suitable for the MapReduce model (or for parallel execution altogether). In general, if data entries are somehow connected and scattered pieces of data influence each other’s contribution to the final result, then parallelization could be impossible: time series are a good example of data that normally needs to be processed sequentially, because the final result depends on the sequence of adjacent data.
For MapReduce, requirements are even higher: in order to gain an advantage applying it, we need data which can be grouped by some attributes/fields and reduced for each of these groups separately.
The operation performed in reducers, moreover, must be associative, and the order in which the intermediate sub-lists are outputted by mappers must not matter.
Figure 4. Comparing “classical” approach to parallel computing to MapReduce. We assume that in both cases data is already broken down into chunks and can be passed “by location”, i.e. providing something like a file handle, without the need for the orchestrator to read the data.
In the basic parallel approach, using processes (either running on threads or on different machines) the orchestrator need to spin the threads and make sense of their results; it can either combine these results itself, or spawn more processes to combine them (as shown in figure) but it becomes the bottleneck either way.
It’s worth noting that if, instead of cataloguing all the coins, we counted how many there are (without distinguishing their type) or computing the total value, we wouldn’t need reducers: each parallel process would output its total, and then the totals would be added by a single central process. If that was the case, applying MapReduce would be overkill.
The second catch is that there are no centralized entities which splits the work and distributes it evenly to the reducers. One can become busy while another sits idle: going back to our story, while Daisy Duck, as we have seen, must worry about US currency, Gladstone Gander is assigned all rare coins from small Countries (lucky him – how could it be any different?) and the lists he gets are almost all empty; he must perform a few additions.
There is More, Under the Hood
We have seen a few advantages of the MapReduce model, but there’s also more to his success which can’t be seen at a high level. Every parallel computation with a shared state needs synchronization to aggregate results, break down data and assign it to computing units (threads or machines) or check that the processing is complete.
The key advantage of MapReduce is that it intrinsically limits shared state to a minimum (by embracing functional programming concepts like immutability and pure functions), providing a programming paradigm, a way to specify the problem, which forces us to state a problem in such a way to eliminate shared state, and handles the little synchronization still needed under the hood.
That’s all for now. If you want to learn more about the book, check it out on liveBook here.
 References: https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html and originally http://norvig.com/21-days.html#answers.
 Considering a WAN or high-performance cloud service. Local clusters in datacenters, when properly configured, can lower this latency by two orders of magnitude, down to 1ms.
 Typically a per-key count on mappers’ output is done by a third abstraction, the combiners, which are mini-reducers operating on the same machines as mappers, and only on the output of individual mappers: instead of a list with a ton of entry with value 1, the mapper sends out a list with a few entries, reducing both bandwidth and the workload for reducers.
 Methods of an immutable data structures, rather than changing the object A on which they are called, create a new object B whose state is the result of applying the called method to A. For instance, the method that appends an element to a list L1 would create a brand-new list L2 with |L1| + 1 elements, and leave L1 unchanged.
 A pure function is any function that doesn’t have a side effect: takes 0, 1 or more inputs, and returns an output (possibly in the form of a tuple), without relying on any change to the input or to the global state. In a sense, a pure function is exactly like a mathematical function.
 That’s also the reason why, as discussed in the previous sub-sections, it can’t be applied to all those problems that cannot be stated in a way that eliminates shared state.