From Machine Learning Systems by Jeff Smith

Actually using models in the real world is tough. To learn about all the complexity of using models in the real world, we’re going to need to move to the bustle of the big city. This article considers the fastest moving animals in the city, turtles.

Save 37% on Machine Learning Systems. Just enter code fccsmith2 into the discount code box at checkout at

An approach to building predictive microservices that wraps models and then puts those microservices into containers is an interesting way of doing things. This article enhances this approach by using containerized predictive services in systems that are actually being exposed to real requests for predictions. As promised, we’re going to talk about this approach with the help of the speediest of the city’s creatures – the turtle.

Figure 1

Moving at the Speed of Turtles

One of the most successful startups in the entire animal kingdom is Turtle Taxi. They’re a technologically-sophisticated take on the business model of taxis. In many major cities, they’ve largely displaced legacy transportation businesses like Caribou Cabs. Part of their success is due to the ease of use of their mobile app, which allows riders to hail a taxi from anywhere at any time. A less obvious part of their success is machine learning. Those looking to create a success of their business idea may want to check out something like the Jack Rabbit Jakarta website for some advice and inspiration. Turtle Taxi employs a large team of semi-aquatic data scientists and engineers who perform sophisticated online optimization of their transportation infrastructure. In comparison to something like a city bus or rail system, the Turtle Taxi fleet is a much harder system to manage. Since drivers have no fixed schedules and can drive whenever they choose to, the fleet of vehicles available to serve customers is always changing. For transportation businesses that deal with truck deliveries and pickups, using things like Titan Winds TMS software can help to organize and manage their fleet, so that they run much more efficiently for employees and customers. Similarly, customers choose to hail a ride whenever they need one, so there are no static schedules like in a traditional public transit system. This highly dynamic environment creates huge challenges for the data team at Turtle Taxi, however, other ride-sharing companies seem to make it work. For example, one of the biggest taxi-like companies is currently Uber. They allow people to get a driver in a matter of minutes, helping them to get to their desired location instantly. Uber employs a huge number of drivers in major cities to make sure people can get lifts when they want them. As the drivers are constantly rushing around to get from one part of the city to the next, there is a heightened chance of an accident occurring. If an accident did ever happen, passengers could always contact this uber accident lawyer for help. However, Uber is a huge business, so drivers should know that they need to take precautions.

They need to answer important business questions like:

  • Are enough drivers on the road to serve demand?

  • Which driver should serve which request?

  • Should the price of a ride move up or down, based on demand?

  • Are customers getting good or bad service?

  • Are drivers in the right part of town to serve the demand?

So, of course, the Turtle Taxi team has spent a lot of time and effort ensuring that their machine learning system holds to the properties of a reactive system. They use many different models to help their systems make autonomous decisions about their complex business problems, and their infrastructure helps them use those models in real-time, at scale. Other things to take into consideration with the business include the administrative side of things, such as making sure the insurance is correct, or learning how to file a confirmation statement.

Building Services with Tasks

Akka HTTP (which is an excellent choice for a lot of applications) grew out of some work using the actor model of concurrency, http4s (which we are going to use) is more focused on providing a programming model more influenced by functional programming. The main difference between the two design philosophies manifests in the user API. When you use http4s, you won’t be using actors or setting up an execution context for an actor system. http4s is part of the Typelevel family of projects and uses many other libraries from that group in its implementation. While you can definitely build very complex services using http4s, we’ll be primarily using it for its simplicity here.

However, there is one new concept that we should look into before starting to explore how to build services with http4s: Tasks. Tasks are related to futures but are a more sophisticated construct that allow us to reason about things like failure and timeouts. When implemented and used properly, tasks can also be more performant than standard Scala futures, due to how they interact with the underlying concurrency facilities provided by the JVM. In particular, tasks allow you to express computation you might not ever execute. Let’s see how to use this capability in your programs.


Tasks, like futures, are a powerful concurrency abstraction that can be implemented in various different ways. Monix is an alternative implementation of the concept of tasks, also from the Typelevel family of projects. Like futures, tasks allow you to execute asynchronous computation. However, futures are actually eager, as opposed to lazy, by default. In this context, eager simply means that execution begins immediately. Assuming that futures are lazy is a common and logical mistake, but while they are asynchronous, they are in fact eager. So, they actually start executing immediately, even if you might like to delay the start of execution. Listing 1 demonstrates this sometimes-undesirable property. In this listing and the one demonstrating tasks, we’ll presume that doStuff is an expensive, long-running computation that we only want to trigger when we’re ready to.

Listing 1 Eager Futures

 import     ?
 import scala.concurrent.Future
 def doStuff(source: String) = println(s"Doing $source stuff") ?
 val futureVersion = Future(doStuff("Future"))                 ?
 Thread.sleep(1000)                                            ?
 println("After Future instantiation")                         ?

? Importing an execution context to be used for the Future

? Defining a function to represent our expensive work

? Instantiating the Future (and actually starting the work)

? Waiting around for 1 second to make it clear that the previous line has already started working

? Showing that the next line of code will only executes after work from the future has been submitted for execution.

If you execute this code, on your console, you should see output like Listing 2.

Listing 2 Eager Futures Output

 Doing Future stuff
 After Future instantiation

Sometimes this property isn’t an issue, but sometimes it is. In the event that we want to define the long running computation for all of the requests that the service gets, but only actually run that computation 1% of the time, a Future would have our service doing 100 times the amount of work that we would want to do. For example, in a system where you have many models available to make predictions, you might only want to actually perform the prediction on a small subset of qualifying requests for any given model. So, clearly, it would be nice to have another option that doesn’t do work that we don’t want the system to do. Listing 3 shows how tasks behave differently than futures.

Listing 3 Lazy Tasks

 import scalaz.concurrent.Task
 val taskVersion = Task(doStuff("Task")) ?
 Thread.sleep(1000)                      ?
 println("After Task instantiation")     ?                         ?

? Instantiating a Task, but not actually starting the work

? Waiting around for 1 second to make it clear that the previous line has not yet started working

? Showing that the entire second has passed

? Executing the Task

In contrast to Listing 2, this code, if executed, should produce output that looks like Listing 4.

Listing 4 Lazy Tasks Output

 After Task instantiation
 Doing Task stuff

Now, we have control over when and if we do work like long-running computations. This is clearly a powerful feature of tasks, and it’s the basis for many of the rest of the more advanced features of tasks such as cancelability.

When we use http4s to build services, we won’t need to know too much more than this about tasks, but it’s helpful to understand the basis of the performance properties of the library. Tasks are just one aspect of the functionality that http4s provides for building performant services. The library also uses scalaz streams to process arbitrary amounts of data.

That’s just a taste of what you can do with these libraries, but it should be enough for us to start building predictive services.

Predicting Traffic

Now that we’ve introduced the tools, let’s get back to solving the problem. In particular, let’s consider the problem of matching taxi drivers to riders. The Turtle Taxi team uses machine learning to predict successful driver-rider matches. For a given rider, their system will attempt to predict from a set of available drivers, which one they will most likely enjoy riding with (as recorded the driver rating on the mobile app).

To begin, we’ll need to create some models to work with. The Turtle Taxi team uses a lot of models, so we’ll start with building support for multiple models from the very beginning. Instead of using our models in combination (an ensemble), we’ll use one or the other on any given prediction request. Real world machine learning systems use models in lots of different ways. Big data teams like Turtle Taxi’s are often producing many models for different purposes. These models may have different strengths and weaknesses that the team is figuring out, by using the models. Experimentation on different modeling techniques is an important part of the machine learning process. The Turtle Taxi team has built their system to allow them to test different learned models in production, so we’ll approximate that implementation here. In particular, we’ll build a simple model experimentation system that will send some traffic to one model and one to another, to evaluate their performance. Figure 2 shows a simple form of what we’re going to build.

Figure 2. Model Experimentation Architecture

Listing 5 shows how to create two simple stub models that just predict true or false, structured as services. They represent two different models of driver-rider match success.

Listing 5 Stub Models

 import org.http4s._
 import org.http4s.dsl._
 object Models {
   val modelA = HttpService { (1)
     case GET -> Root / "a" / inputData => ?
       val response = true                 ?
       Ok(s"Model A predicted $response.") ?
   val modelB = HttpService {              ?
     case GET -> Root / "b" / inputData =>
       val response = false                ?
       Ok(s"Model B predicted $response.")

? Defining the model as a HTTP service

? Using pattern matching to determine that a request to model A has been received

? Always responding true for model A

? Returning an OK status code with the model’s prediction

? Defining a similar stub model service for model B

? Always returning false

Note that these models are constructed as HTTP services. Once we finish building all the necessary infrastructure, they will be independently accessible via any client capable of sending HTTP to them on the network (e.g. your local computer to start). While we haven’t done everything necessary to expose these models as services, let’s start to flesh out how we would like to call these services. For development purposes, let’s assume that we’re serving all of our predictive functionality from our local computer (localhost) on 8080. Let’s also namespace all of our models by the name of the model under a path named models.

Using those assumptions, we can create some client helper functions to call our models from other parts of our system. It’s important to note that we’re defining this client functionality in Scala in the same project purely as a convenience. Since we’re constructing these services as network accessible HTTP services, other clients could easily be mobile apps implemented in Swift or Java or a web front-end implemented in Javascript. The client functionality in Listing 6 is just an example of what a consumer of the success match predictions functionality might look like.

Listing 6 Predictive Clients

 import org.http4s.Uri
 import org.http4s.client.blaze.PooledHttp1Client
 object Client { (1)
   val client = PooledHttp1Client()                   ?
   private def call(model: String, input: String) = { ?
     val target = Uri.fromString(s"http://localhost:8080/models/$model/$input").toOption.get ?
     client.expect[String](target)                    ?
   def callA(input: String) = call("a", input)        ?
   def callB(input: String) = call("b", input)        ?

? Creating an object to contain client helpers

? Instantiating an HTTP client to call modeling services

? Factoring out the common steps of calling models to a helper function

? Creating a URI to call the model from dynamic input and forcing immediate (optimistic) parsing

? Creating a Task to define the request

? Creating a function to call model A

? Creating a function to call model B

The usage of .toOption.get is not really good style; we’re just using it as a development convenience. The implementation of the URI building functionality in http4s is trying to be a bit safer about dynamically generated values like the name of the model and the input data. A future refactor of this code could focus on more sophisticated error handling or use a statically defined route, but for now, we’ll accept that we could receive unprocessable input that would throw errors.

We want to expose a public API that abstracts over how many models we might have published to the server at any given time. Right now, the turtles want to have model A receiving 40% of the requests for predictions and model B receiving the remaining 60%. This is just an arbitrary choice that they’ve made for preferring model B, until model A demonstrates superior performance. We’ll encode that split using a simple splitting function to divide our traffic based on the hash code of the input data. Listing 7 shows the implementation of this hashing function.

Listing 7 Splitting Prediction Requests

 def splitTraffic(data: String) = {        ?
   data.hashCode % 10 match {              ?
     case x if x < 4 => Client.callA(data) ?
     case _ => Client.callB(data)          ?

? A function to split traffic based on input

? Hashing the input and taking the modulus to determine which model to use

? Using pattern matching to select model A 40% of the time

? Using model B in the remainder of cases

If we had more models deployed, we could simply extend this approach to something more dynamic, based on the total number of models deployed and the amount of traffic that they should receive. Now that we have these pieces in place, we can bring all of this together into a unified model server. In this case, we’re going to define our public API being located at a path named api and the prediction functionality specifically being located under the predict path thereof (Listing 8).

Listing 8 A Model Service

 import org.http4s.server.{Server, ServerApp}
 import org.http4s.server.blaze._
 import org.http4s._
 import org.http4s.dsl._
 import scalaz.concurrent.Task
 object ModelServer extends ServerApp { (1)
   val apiService = HttpService {                 ?
     case GET -> Root / "predict" / inputData =>  ?
       val response = splitTraffic(inputData).run ?
       Ok(response)     ?
   override def server(args: List[String]): Task[Server] = { ?
     BlazeBuilder       ?
       .bindLocal(8080) ?
       .mountService(apiService, "/api")          ?
       .mountService(Models.modelA, "/models")    ?
       .mountService(Models.modelB, "/models")    ?
       .start           ?

? Defining the model serving service as a ServerApp for graceful shutdown

? Defining another HttpService to be the primary API endpoint for external use

? Using pattern matching to define when a prediction request has been received

? Passing the input data to the traffic splitting function and immediately invoking it

? Passing through the response with an OK status

? Defining the behavior of the of the server

? Using the built-in backend from Blaze to build up the server

? Binding to port 8080 on the local machine

? Mounting the API service to a path at /api

? Attaching the service for model A to the server at /models

? Attaching the service for model B to the server at /models

? Starting the server

Now, we can actually see our model server in action. If you’ve defined a way build to the application, you can now build and run your application. For an example of how to setup a build for this application, see the online resources for Reactive Machine Learning Systems. Once your application can be built, you can simply issue sbt run and your service should start up and bind to port 8080 on your local machine. You can test your service using a standard web browser and hitting the API endpoint with various endpoints. For example, if the string abc represented a valid feature vector for this service, then hitting http://localhost:8080/api/predict/abc produces a prediction of false (no match) from a prediction from model B.

Looking back on what we’ve just built, it has some useful functionality. It has a simple way of handling multiple models. Moreover, it should be pretty obvious how we could get at least some elasticity by just starting up more instances of our model services and maybe putting them behind a load balancer.

Figure 3. Load Balanced Model Services

It’s not a bad approach, but it still lacks some realism. Turtles are tough creatures who know how to prepare for the worst that life can throw at them. If you want to know more about how these crafty turtles are enhancing their machine learning systems, go grab the book!

That’s all for this article.

For more, check out the book on liveBook here.