A Q & A with Jeff Smith

Learn more about why making Machine Learning systems reactive matters with Jeff Smith , author of Machine Learning Systems.

Jeff Smith builds large-scale machine learning systems using Scala and Spark. For the past decade, he has been working on data science applications at various startups in New York, San Francisco, and Hong Kong. He blogs and speaks about various aspects of building real world machine learning systems.

Read more author interviews here.

Jump into any page of the book from the table of contents on liveBook!

Take 39% off Machine Learning Systems. Just enter code intsmith into the discount code box at checkout at manning.com.

Lynn: Where did the concept for Machine Learning Systems come from?

Jeff: The ideas in this book come from several places, not just my personal experience. There is a rich literature on distributed systems, data management, uncertainty, and functional programming that has been built up over decades of computer science research. Only recently have people started to apply all of these other advances to the problems of machine learning systems, so there’s still a lot of exciting cross-pollination for people to achieve.

Lynn: Why would you try to make your machine learning system reactive?

Jeff: Making any system reactive is always about the same thing: actually achieving the mission of the system. The reactive manifesto is simply an observation about the new standards that applications are being held to. They must be highly-available yet be capable of dynamically scaling infrastructure and its associated cost. Even seemingly simple applications are now expected to be distributed where the failure of one component doesn’t contaminate the state of another component. The world just expects a lot more out of software these days, and users are just far less tolerant of poorly behaved software than they used to be, for all sorts of applications.

The story is no different in machine learning. Whether you machine learning system is trying to recommend products, converse in natural language using conversational ui, or trade stocks, the default expectation is that your system will stay return predictions despite failure or changes in load. But, these requirements can be really hard to achieve in a machine learning system, because machine learning systems are so intrinsically complicated.

So, organizations that build truly reactive machine learning systems are going to be more capable of delivering what their users want. Whether you’re in a startup, a big company, or even research, that’s going to make you a whole lot more likely to succeed.

Lynn: Do you have a success story about a reactive machine learning system you developed?

Jeff: Personally, I’ve been able to apply the techniques in this book to my work on building Amy, the artificial intelligence that powers x.ai. At x.ai, we’re using machine learning to build a personal assistant that works over email to schedule your meetings. It’s probably the most amazingly sophisticated machine learning system I’ve ever encountered. It takes so much complexity behind the scenes to replicate the experience of a competent human personal assistant. Having consistent behavior at massive scale is absolutely necessary, and so we take advantage of a lot of the techniques that I describe in the book.

Before that, I’ve been able to apply some of these techniques to the problems of ad segmentation and valuation at Intent Media. Ensuring that that system was reactive was really core to the mission of that system, because it was making decisions in milliseconds that directly converted into revenue for our customers and our company. If that system was anything less than perfectly responsive, it would have cost us money.

And before that, I worked on predicting the stock market at Aidyia, an AI hedge fund in Hong Kong. From checking air nz stock to wall street predictions, I did it all! As you might imagine, you really need to build a very sound machine learning system before you’re going to give it access to trading millions of dollars using machine learning models. I’ve thought about returning to the stock market, investing just for myself, although I may need to read some motley fool reviews to brush up on my skills and familiarise myself with the stock market scene once again. Maybe I would just get an advisor instead so that I could focus on other things too, shouldn’t be too difficult with so many of them out there, they can check out the Stock signals to see which area is more viable.

But I can’t actually tell you too much more about those systems than those generalities. I certainly can’t show you the source code from any of those systems, which is why I decided to write a book with a bunch of cartoon animals doing machine learning. I want to be able to share techniques that are useful in the real world, even though I can’t show the source code of the systems that I work on every day. The fictitious systems in the book are exactly like real world systems, with all of the same characteristics and challenges, but they’re newly created as a way of talking about how to make real world machine learning systems into reactive ones.

I’ve been working in data science since 2008, and I’ve always felt like the machine learning community has not done a good job of sharing examples of canonical reference architectures or even basic programming techniques useful in building ML systems. One of the best examples of a canonical machine learning architecture is Sean Owen’s work on Oryx, and Mahout before it. Though not conceived as such, it is absolutely a reactive machine learning system. More recently, Simon Chan’s work on prediction.io is a similar sort of reference machine learning architecture that has all sorts of reactive properties. These examples have a lot in common: event-based data modeling, Spark pipelines, prediction servers, etc.

Reference architectures like this make it clear that machine learning systems can be built in ways that are really sophisticated and actually achieve the goals of a reactive system. We’re starting to see more tech companies open source parts of their data and machine learning stacks like Google’s TensorFlow and Airbnb’s Aerosolve. I’m excited that these open source implementations give us as a professional community a basis for discussion about desirable system properties and how to achieve them.

It’s my hope that the book will provide a clear introduction to those insights that we’ve learned about how to build machine learning systems that actually achieve their missions in the real world.

Lynn: Can you really make a machine learning system as reliable as a web app?

Jeff: Yes, you totally can. By employing the principles of reactive systems design, you can ensure that your machine learning application is responsive, resilient, and elastic. But you’ll need more than just principles to guide you. You’ll want to use the best tools and techniques for the job. I would suggest using things like Scala, Spark, Akka, and distributed databases.

But it’s not just about tools. You need to really be able to break down what a machine learning system is, and understand how to build each of those components properly. A lot of the time, you’ll find that you need to use techniques from functional programming or reactive programming. They’re a natural fit for building a sophisticated distributed system like a machine learning system.

There’s also a collection of different design patterns that you can bring to bear on common problems in building machine learning systems. These are things like model learning facades, model supervisors, and predictive microservices.

A machine learning system is certainly complex, but you can build one that does exactly what you want it to, no matter what happens.

Lynn: Why use Scala to build a machine learning system? Can’t you just use Python?

Jeff: Maybe you could, and certainly, I’ve tried. But there are just some fundamental deficiencies in Python as a runtime that affect what you can achieve with it “out of the box.” Applications built on the PyData stack (scikit-learn, pandas, NumPy, etc.) lack simple concurrency, parallelism, futures, actors, etc.

The multithreaded runtime of the JVM put Scala in a much better place to handle functionality like parallelism and futures. Scala is also a really well-designed functional programming language with a sophisticated type system. Both of those features are really helpful when structuring a semantically sophisticated application like a machine learning system.

Beyond the language itself, Scala has some really useful libraries. Akka is an incredibly powerful toolkit for building actor systems for all sorts of applications.

Of course, the biggest project in all of big data is Spark, which is also written in Scala. Although you can use Spark in a bunch of languages, building your Spark app in Scala is just easier and more natural. I once tried to build the same Spark application in Scala, Java, Python, and Clojure. The Scala one was by far the easiest one to work with and grow, as I wanted to enrich its functionality.

People who are particularly concerned with the questions of Scala versus Python for machine learning are often thinking about all of the powerful machine learning functionality provided by Python libraries like scikit-learn. MLlib, Spark’s machine learning library, doesn’t have as many algorithms as scikit-learn has, but it has a wide range of advantages: scalability, simpler extensibility, a consistent API, and more. If you’re able to build the system that you want in Scala, you will often wind up with something that is far more “production ready” than the comparable Python app, and that’s valuable.

Lynn: Is Spark really better than Hadoop MapReduce for machine learning?

Jeff: Yes, it really is. Spark is much easier to use than Hadoop MapReduce, and that includes Hadoop abstractions like Pig, Cascading, and Mahout. Spark’s API is extremely well-designed and high-level. Spark applications are much easier to build and run locally to prove correctness before executing on a cluster, making for faster experimentation iterations. Once executed on a cluster, Spark applications scale much better and run much faster than their Hadoop equivalents, allowing for the use of more iterations, more models, larger ensembles, and everything else required to improve modeling performance.

And things are only getting better for machine learning on Spark. MLlib, Spark’s machine learning library continues to add functionality, get easier to use, and run faster. The progress that the Spark community has made with building machine learning functionality is really pretty incredible.

Incidentally, Spark and MLlib are also the best way that I know of to build a large-scale machine learning system in languages that aren’t Scala, such as Python or R. Spark gives those languages an incredible scale and reliability that is hard to achieve another way.

Lynn: Do you have to convert your whole system to this design to get the benefits of having a reactive machine learning system?

Jeff: Absolutely not. The book covers a very broad territory, an entire reactive machine learning system, but you can totally get started with just a part of your system. If you wanted to have a data engineer focus on just building a reactive feature generation pipeline that ensured properties like resilience and elasticity, then you would get the benefits of improving that part of your system.

At the same time, you could have a data scientist working through how to publish models using a possible worlds semantic that was fully aware of the uncertainty of the data in your system. That would be a different project, with a different set of benefits. Parts of this book are focused on DevOps/infrastructure concerns; other parts are focused on testing. Everyone involved in building your machine learning system can be doing something to make your machine learning system more reactive.

Machine Learning Systems gives you a way of thinking about your whole system, but it’s not an all-or-nothing proposition. The book tries to provide you with a coherent roadmap for all of the improvements you might want to make to your system, but you could get started with reactive machine learning in just an afternoon.