From Engineering Deep Learning Systems by Chi Wang and Donald Szeto

This article presents what prospective readers can expect to learn from this book and why you should learn it.

Read it if you’re a software developer interested in transitioning your skills to the field of deep learning system design or an engineering-minded data scientist who want to build more effective delivery pipelines.

Take 25% off Engineering Deep Learning Systems by entering fccwang2 into the discount code box at checkout at

In the deep learning field, it is the models that get all the attention. Perhaps rightly so, when you consider that new applications developed from those models are coming onto the market regularly—applications that get consumers excited, such as human-detecting security cameras, accurate voice-recognizing phone menus, fast character recognition in multiple languages with almost instant language translation, and advanced driver assistance systems that can one day lead to fully autonomous and self-driving cars. Within a very short period of time, the deep learning field is filled with immense excitement, and promising potential waiting to be fully realized.

But the model does not act alone. In order to bring a product or service to fruition, a model needs to be situated within a system or platform (we use these terms interchangeably) that supports the model with various services and stores. It needs, for instance, an API, a dataset manager, and storage for artifacts and metadata, among others. So, behind every team of deep learning model developers is a team of non-deep learning developers creating the infrastructure that holds the model and all the other components.

The problem we have observed in the industry is that often the developers tasked with designing the deep learning system and components have only a cursory knowledge of deep learning. They do not understand the specific requirements that deep learning needs from system engineering, so they tend to follow generic approaches when building the system. For example, they might choose to abstract out all work related to deep learning model development to the data scientist, and only focus on automation. So the system they build relies on a traditional job scheduling system or business intelligence data analysis system, which are not optimized for how deep learning training jobs are run, nor for deep learning-specific data access patterns. As a result, the system is hard to use for model development, and model shipping velocity is slow.  Essentially, engineers who lack a deep understanding of deep learning are being asked to build systems to support deep learning models. As a consequence, these developers are engineering systems that are inefficient and inappropriate for DL systems.

Our goal is to help those developers and engineers design and build more effective systems to support deep learning. These developers–or developers who want to move into the deep learning field–should understand how deep learning systems are designed and put together, as well as how to: gather relevant requirements, translate requirements to system component design choices, and integrate components together to form a cohesive system that works well for all users.

The first step is to understand the system, as a whole, which supports deep learning models and deep learning product development. That’s what we’ll look at here–a typical, generic deep learning system and all its components.

Let’s start with a picture. In Figure 1, you’ll see an overview of a typical basic deep learning system.

Figure 1. An overview of a typical deep learning system that includes basic components to support a deep learning development cycle. In later chapters we discuss each component in detail and explain how they fit into this big picture.

The system in question is defined as all rectangular boxes within the dashes-outlined box, and its Application Programming Interface (API). These boxes each represents a system component:

  • Application programming interface (API)
  • Dataset manager
  • Model trainer
  • Model server
  • Metadata and artifacts store
  • Workflow manager
  • Model metrics store

In this book, we assume that they are microservices. This provides the convenient assumption that these components can reasonably support multiple users with different roles securely, and are readily accessible over a network or the Internet. This book, however, will not cover all engineering aspects of how microservices are designed or built. We will focus our discussion on specifics that are relevant to deep learning.

NOTE  Hosted services

You may wonder if you need to design, build and host all deep learning system components on your own. Indeed, there are open source and hosted alternatives for them. We hope that after you have learned the fundamentals of each component, how they fit in the big picture, and how they are used by different roles, will help you make the best decision for your use case.


Let’s take a quick tour through the system components, as shown in the figure 1.

Application Programming Interface

The entry point of our deep learning system is an application programming interface (API) that is accessible over a network. We opted for an API because the system needs to support not only human user interfaces, but also with applications and possibly other systems.

While conceptually the API is the single point of entry of the system, it is entirely possible that the API be defined as the sum of all APIs provided by each component, without an extra layer that aggregates everything under a single service endpoint. Throughout this book, we will use the sum of all APIs provided by each component directly, and skip the aggregation for simplicity.

Dataset Manager

Before a model can be trained, there exists data. The job of the dataset manager is to help organize data into units of datasets. These datasets are bounded in size and are tagged with metadata that describe them, e.g. this dataset contains images that are encoded by a certain algorithm. Both the data and the metadata of datasets can be used during model training.

Perhaps one special important callout is the support of dataset versioning. Since deep learning is a garbage-in-garbage-out business, and data being the origin of almost all side effects produced by a deep learning system, it is vital to be able to trace to the source that causes undesirable changes.

Model Trainer

Once you have good data the logical next step would be to perform training on them to produce a model. A majority of functionality is provided by frameworks such as TensorFlow or PyTorch, and in this book we do not aim to reinvent that. Rather, we will focus on how to perform model training efficiently and securely in a resource-constrained scenario, and explore advanced training techniques such as hyperparameter tuning, distributed training. We will also talk about experimentation where multiple models are trained with their performances compared.

Model Server

Once models are trained, they can be used to produce inferences on data that is not previously seen by the trainer. Similar to training, many frameworks provide the functionality of producing inferences using models produced within the same framework. Again, in this book, we are not going to explain how to produce inferences from models. We will instead focus on serving architectures that can serve multiple models to high traffic volume.

Metadata and Artifacts Store

This is the store where trainer code, inference code, and trained models are stored together with their metadata that describe them. These metadata help preserve the relationship between datasets, trainer code, inference code, trained models, inferences and metrics to provide complete traceability in the system. Certain static metrics, such as model training metrics, may also reside in this store. Later in the book, we will discuss the importance of this store for experimentation and advanced training techniques.

Workflow Manager

The workflow manager is the glue piece that ties all executions within the system together. A typical example would be

  1. Detect creation of new datasets
  2. Launch model training on new datasets
  3. Deploy trained model to model server based on passing some predefined criteria

We will also talk about how this component helps with advanced scenarios such as experimentation and hyperparameter tuning automation.

Model Metrics Store

Contrast to static model training metrics that may live in the metadata and artifacts store, the model metrics store holds time series metrics that are generated from serving models. In this book, we will not talk about how to build the store, but will talk about important metrics that should be captured, and explore existing options that can be used to store them.

Now that you know the components that make up a deep learning system, you are ready to start designing and building those components!