From Machine Learning Systems by Jeff Smith

Many people don’t even mention data collection when discussing the work of building machine learning systems. This article discusses the collection of uncertain data and collecting data at scale.

Save 37% on Machine Learning Systems. Just enter code fccsmith2 into the discount code box at checkout at

In this article, you will play the role of a noble lion queen. As matriarch of your pride, you take your job very seriously.

  You have one age-old problem, though: your food doesn’t stay still. Every spring, the wildebeests and zebras that you feed on have the annoying habit of leaving the southern grassland plains to migrate north. Then, every fall, these herbivores just turn around and head back south for the rainy season.


As a responsible queen and mother, you must track the movements of this mass migration of your prey. If you didn’t, then your pride would have nothing to eat. But the data management problems of this job are severe (Figure 1).

Figure 1. Great Migration Data

To get any sort of handle on this big, fast, and hairy data, you’re going to need to deploy some advanced technology. You have a long term vision that you will one day be able to use this data to build a large-scale machine learning system that will predict where the prey will be next, so that your lionesses can be there waiting for them. But before you can even consider building any systems on top of this data, you’re going to need to collect this data and persist it somehow. Thanks to a recently signed contract with technology consultants Vulture Corp., you now have access to some sensor data about the movement of land-bound animals (Figure 2).

Figure 2. Vulture Corp.

The Vulture Corp. “Eyes in the Skies” system is based on an aerially-deployed, distributed network of sensors. These sensors use a combination of body heat detection and movement patterns to report back the number of different animals and the kind of those animals at any given location. An example of the sort of data that this system provides is shown in Table 1.

Table 1. Raw Animal Sensor Data

123456 37 83 1442074486 17 3

❶  A unique identifier for the reading

❷  A unique identifier for the sensor

❸  A unique identifier for the location of the reading

❹  The time when the reading was taken

❺  The number of wildebeests detected

❻  The number of zebras detected


But this raw schema isn’t really what the Eyes in the Skies system understands about the animals below. The sensors only provide an approximate view of what’s going on using body heat and movement. There is always going to be some uncertainty in that process.

  You need more data so you negotiate (intimidate) with Vulture Corp. for a better feed. That data feed, shown in Table 2, is more explicit about the difficulty of being precise in raw sensor data like this.

Table 2a. Even More Raw Animal Sensor Data

123456 37 83 1442074486

Table 2b

16 24 0.15

❶  The lower bound of animals detected

❷  The upper bound of animals detected

❸  The percent of animals detected that are zebras


This data model has far more statistical richness. It expresses that there may have been much less or much more than 20 animals at that location. A consultation of the Eyes in the Skies API documentation reveals that these are the upper and lower bound of the probability distribution of the sensed data, at a 95% confidence level.

Confidence Intervals

The uncertain data model shown in Table 2 uses a confidence interval. This is a measure of the amount of uncertainty around the sensor reading. The confidence level refers to the percentage of all possible readings that would be expected to include the true count of animals.

The difference between 16 and 24 prey animals might not sound big, and in some contexts it might not be. Some readings have lower bounds that are zero and upper bounds that are non-zero. So, for those locations, you could send lionesses to them, expecting to find one or two wildebeests and find none at all when you arrive. Thanks to this explicitly uncertain data model, you as the lion queen can now make more informed decisions about where to allocate your scarce hunting resources.

The difference between these two data models is an example of a data transformation. In a data processing system like a machine learning system, many of the operations of the system are effectively data transformations. The original sensor data feed from the Eyes in the Skies system was actually a transformation of a more raw form of the data that was originally kept internal to Vulture Corp. As we saw, transforming this raw data actually caused us to lose information about the intrinsic uncertainty in the sensor readings. This is a common mistake that people make when implementing data processing systems: people will destroy valuable data by only persisting some derived version of the raw data.

Consider this proposal from a young lioness working on the sensor data from the Eyes in the Skies system shown in Table 3.

Table 3. Transformed Animal Sensor Data

123456 37 83 1442074486 TRUE

❶  Whether there is enough prey to bother hunting in this location or not.


In this heavily transformed version of the data, the young lioness developer has decided to simplify things down to just a boolean value representing whether there are more than 10 wildebeests in a given location. That’s the cutoff value that you’ve been using lately to decide if a location should be hunted. Her thinking was that this was all you really needed to know to make a decision anyway. But you are an experienced lion queen. In a bad year, you might go out to stalk a single zebra foal.

You may need all of the richness of the data model in Table 3 to make the hard decisions when the time comes. This is an illustration of a fundamental strategy in data collection. You should always collect data as immutable facts.

  A fact is merely a true record about something that happened. In the example of the Eyes in the Skies system, the thing that happened was that an aerial sensor sensed some animals. To be able to know when that fact occurred, you should usually record the time that it occurred at, although there are some interesting choices about how that time can be expressed. In the example above, we used a simple timestamp to say when a sensor reading was recorded. Similarly, it is often a good idea to record the entity or entities that the fact relates to. In the animal sensor data, we recorded both the entity from which the fact originated, the sensor, and the entity being described in the fact, the location. Even though there was some uncertainty around the sensor data collected, the facts in Table 3 will never need to be changed. They will always be a fact about the system’s view of the world at that point in time.


Immutable facts are an important part of building a reactive machine learning system. Not only do they help you build a system that can manage uncertainty, they also form the foundation for strategies that deal with data collection problems that only emerge once your system gets to a certain scale.

Collecting Data at Scale

The amazing thing about the Great Migration is its scale. Millions of animals are all on the move at once. The wildebeests are the most numerous of these animals, but there are still hundreds of thousands of gazelles and zebras. Beyond these top three meals on hooves, there is still a long tail of lesser animals to consider.

From high up on your headquarters on Lion Rock, you can only see so much. The Eyes in the Skies system has given you a starting point for understanding the state of the savannah, but it’s just a starting point. It’s clear that you need to start processing this data into more useful representations if you want to be able to take action on any of this data.

Maintaining State in a Distributed System

As your first project, you’ve decided to try to track the density of prey in each region. A region is a geographically contiguous set of locations, each with their own sensor feed. The density statistic that you’d like maintained could be simply expressed in Listing 1.

Listing 1 Calculating Location Densities

 case class LocationReading(animals: Integer, size: Double) ❶
 val reading1 = LocationReading(13, 42.0)                   ❷
 val reading2 = LocationReading(7, 30.5)
 val readings = Set(reading1, reading2)                     ❸
 val totalAnimals = readings.foldLeft(0)(_ + _.animals)     ❹
 val totalSize = readings.foldLeft(0.0)(_ + _.size)         ❺
 val density = totalAnimals/totalSize                       ❻


❶  A case class representing a reading of the number of animals at a particular location as well as the size of that location

❷  An example instance of a reading

❸  A collection of readings

❹  The sum of all animals in the region

❺  The sum of the square miles of all of the locations in the region

❻  The density in terms of animals/square mile, 27.6 in this example



The summing operations above are implemented using a fold. Folding is a common functional programming technique. The foldLeft operator used begins with an initial sum of zero. The second argument is the higher-order function to be applied to each item in the set of readings. In a sum, this higher-order function is just the addition of the running sum with the next item. But folding is a powerful technique that can be used for more than just summing.

This sort of approach is reasonable in some ways. The code above uses only immutable data structures, so all values used in calculations can be thought of as facts. Summing is done using a pure, higher order function, +, so there is no chance of a side effect in the summing function causing unforeseen problems. But it’s not yet time for the lions to sleep tonight. Things are going to quickly get trickier.

For you to know which regions have the most density of prey without going out to where the sensor readings are being reported, you’ll need to aggregate all of those readings in a single location – your headquarters at Lion Rock.

So, you sign a contract with the Cheetah Post message delivery company (Figure 3).

Figure 3. Cheetah Post

They’ll go out to each of the data collection stations and get the latest reading. Then they’ll rush the message about that information back to Lion Rock.

  That latest sensor reading will then be added to the aggregate view of all of the locations. Anticipating problems with a bunch of cheetahs running back and forth, you decide to do what any seasoned leader would do: you put a pangolin in charge.

In exchange for you agreeing not to eat him, the pangolin has agreed to maintain the current state of the density data as part of the system shown in Figure 4.

Figure 4. Simple Density Data System Architecture

He implements the state management process in Listing 2. It shows an example of how this aggregate view of the savannah can be maintained. The example update scenario is the receipt of a message about there being a high density of animals in Region 4.

Listing 2 Aggregating Regional Densities


 case class Region(id: Int)                    ❶

 import collection.mutable.HashMap
 var densities = new HashMap[Region, Double]() ❷

 densities.put(Region(4), 52.4)                ❸


❶  A case class representing a region

❷  A mutable hashmap storing the latest density values recorded by region

❸  This update will overwrite the previous value for Region 4 with the new value

By putting a single pangolin in charge of this process, you’ve ensured that the cheetahs will never fight about making updates. Moreover, by making all of the cheetahs line up to talk to the pangolin, you’ve ensured that all updates are processed in order of their arrival. But the constant change in the number of animals leads to there being more cheetahs (with more updates) than you originally planned for. The process of recording these updates quickly becomes too much for one pangolin to handle.

So you decide to hire another pangolin.

  Now there are two pangolins and two queues that cheetahs can line up in to get their updates applied as shown in Figure 5.

Figure 5. Adding Multiple Queues

At first this seems like a solution. After all, you can keep hiring more pangolins as your scale of data collection goes up. This gives you some ability to continue to apply your updates in the same amount of time despite increasing load.

  But that initial elasticity quickly goes away. Part of the reason is that while one pangolin is making an update to the system, the other pangolin can only wait. So while pangolins do spend some time walking back and forth from queues of cheetahs to the update computer, they quickly end up spending most of their time waiting for access to the one computer. This means that updates about one region end up blocking updates about another region.

You decide to try adding more pangolins with more computers and implement the system shown in Figure 6.

Figure 6. Concurrent Access to Shared Mutable State

To enable multiple pangolins to make updates concurrently you decide to change the data structure used to store the densities data as shown in Listing 3.

Listing 3 Concurrently Accessible Densities


 import collection.mutable._

 var synchronizedDensities = new LinkedHashMap[Region, Double]() with SynchronizedMap[Region, Double]


This implementation now allows for concurrent access to make updates to the densities using a system of locks that ensures that each thread of execution has the latest view of the data. So, different pangolins can be at different computers making different updates, but each pangolin holds a lock for the time it takes to make his or her update. At first, this looks like an improvement, but the performance ultimately ends up being similar to the old system. The synchronization process and its locking mechanism is actually quite similar to the old single computer bottleneck. You’ve merely narrowed the scope of what the scarce resource is down to a lock on the mutable data structure. With this bottleneck, you can no longer add more pangolins to get more throughput; they would just contend for locks on the densities hashmap. There’s another unfortunate outcome of this new system. Cheetahs can get in any queue they want. Some pangolins work faster than other pangolins. This system will now allow some updates will to be processed out of order.

For example, region 6 of the savannah had a high density of animals this morning before all the zebras moved on. If the updates about these sensor readings are applied in order, you will have an accurate view of this region as shown in Listing 4.


Listing 4 In Order Updates

 densities.put(Region(6), 73.6) ❶
 densities.put(Region(6), 0.5)  ❷
 densities.get(Region(6)).get   ❸


❶  The morning update

❷  The afternoon update

❸  Returns 0.5


But now it’s also possible for updates to be applied out of order. A sequence of out of order updates gives you a very different view of the situation (Listing 5).

Listing 5 Out of Order Updates

 densities.put(Region(6), 0.5)  ❶
 densities.put(Region(6), 73.6) ❷
 densities.get(Region(6)).get   ❸


❶  The morning update

❷  The afternoon update

❸  Returns 73.6


In this second scenario, you’re sending out your valuable lionesses to go hunt in an area where you should already know that all of the animals have moved on. If you look back at the first sequence of updates, it also has deficiencies. In the afternoon, you have an accurate view of the lack of prey in Region 6, if the updates are applied as in Listing 4.

But what happened in the morning?

There should have been lionesses out there in such a prey-rich region, but they were just lounging around. By the time the afternoon rolled around, all you knew was that there was no more prey in Region 6. You had no idea that a few lazy lionesses missed the day’s best opportunity to hunt. There has to be a better way of organizing a hunt.

That’s all for this article.

For more, check out the whole book on liveBook here.