An interview with Andrew Ferlitsch by Frances Lefkowitz

Andrew Ferlitsch, from the developer relations team at Google Cloud AI, is so far out on the cutting edge of machine learning and artificial intelligence that he has to invent new terminology to describe what’s happening in Cloud AI with Google Cloud’s enterprise clients. In this interview with editors at Manning Publications, he talks about the current and coming changes in machine learning systems, starting with the concept of model amalgamation. Ferlitsch is currently writing a book, Deep Learning Design Patterns, which collects his ideas along with the most important composable model components.

Read more author interviews here.

Access to pre-publication chapters is available here. Take 40% off by entering intferlitsch into the discount code box at checkout at

Manning: Let’s start with the obvious: What do you mean by model amalgamation?

Andrew Ferlitsch: To explain that, I need to give a little history. In 2017, we were working under a paradigm of intelligent automation: you took a process, you broke it down into steps, you’d try to figure out how to automate the individual steps using machine learning and then you would chain them together. So each model is independent: independently trained, independently designed, and independently called. And it’s all being orchestrated by some backend software application.

By 2018, at least at the production level, we got rid of the idea of having a separate model for every task, and started making multi-task models. Multi-task or multi-stage models solve multiple tasks at the same time, within one model, using different stages and components. That’s pretty much how we do it today.

But in 2019, we started to realize that we can expand this idea even further, and do distributed multi-tasking, where there are multiple models, each performing different tasks, and directly communicating with each other, without a backend orchestrating it. They share layers between them, and information–all kinds of connections. In other words, there is no backend application anymore and the models become the entire application.

That’s what I call model amalgamation.

M: What’s the advantage of stringing together models? For the developers, is it saving time because they can reuse composable model components or patterns written by someone else, rather than building every component from scratch?

AF: There are several advantages. You are correct that reuse is one of them. By assembling configurable components, it simplifies the design and speeds up development. When you say stringing together models, I believe you are referring to models sharing layers amongst themselves. With sharing layers, models can be trained faster, and have faster response times with smaller memory footprints, which allows models to further scale.

M: Are there any advantages outside of the engineering department?

AF: Any advance in the engineering of deep learning has effects elsewhere. I think Jeff Dean, head of Google AI, summed it up well at a 2018 conference when he described how, by using deep learning, 500,000 lines of open source OCR (Optical Character Recognition) code was replaced with a mere 500 lines of Tensorflow. The code base was a thousand times smaller, a thousand times less costly to build and maintain–and more reliable! It’s not just the dramatic cost reduction, it’s that deep learning frees up engineering staff to tackle higher-value challenges to benefit both the company and society.

M: So, in the evolution of deep learning, what comes after model amalgamation?

AF: As we bring in more and more automatic learning, we are entering into an approach we call machine design, where now the system itself learns how to put itself together, and learns the best way to communicate across interfaces.

Think of it like computer-aided design, where a computer program built on expert rules assists the designer in designing a system. In machine design, we are advancing to the next step, where the designer sets the objectives and guides the search space, and the machine learns the optimal way to design a model amalgamation.

M: OK, so this is sort of opposite, or maybe inside-out, from the generic use of the term “machine design,” where we would design and create a machine. For you, machine design means the machine is doing the designing?

AF: Absolutely. With the automatic learning that’s emerging into production, the machine is designing, excuse me, it’s learning to design the entire application. That’s what I mean by “machine design.”

M: So if the machine is building itself, what’s left for the developers to do?

AF: We use our skills more and more to guide this process. But the process itself has started building the applications: learning the problem, learning the solution, and learning to build the application. And as I’ve said, this frees up engineering staff to tackle other challenges.

M: Let’s get more concrete to see how these approaches play out. Would you walk me through a real-life model amalgamation system?

AF: An early amalgamation I worked on is in sports broadcasting. It’s a little hard to describe without visuals, but I’ll give it a try.

You start off with live-streaming video of, say, a baseball game. As it’s streaming, you’re pulling frames out, and you might start off at one stage doing object detection. So the system starts to recognize this object as a player and another object as a bat, maybe another object as a ball. Another model in the system could segment out the objects identified as players and put a bounding box around them, and then send it to another model that does facial recognition. Downstream, another model takes this information from a sequence of frames along with pose detection and creates a caption of that part of the video that says, “Player X holding a bat at the plate” or “Player X swinging the bat.”

For different language markets, another model can translate the captions in real time to other languages. And finally, for the visually impaired and radio audiences, the language-translated captions can be converted to voice by a text-to-speech model.

You can see now how the model amalgamation is the entire application!

M: Wow. That’s a lot of tasks coming from one initial set of data.

AF: Yes, models in production today don’t have a single output layer. Instead they have multiple output layers, from essential feature extraction, feature vectors, encodings, to probability distributions…numerous layers of information being processed at the same time.

M: How exactly does it do pose identification of a batter in a baseball game?

AF: By identifying 18 key points on the human body–major joints and facial landmarks–this model looks at the relationship of these points to each other. And from that, it can be trained to say what the person is doing, whether the player is swinging the bat, say, or getting ready to swing. And then all that data can be put downstream to something that takes that information and automatically generates text, then to a translation model, text-to-voice, and the voice could be connected to a WaveNet audio model which does a natural synthesis of the voice in a target language and accent.

M: Ah, so each downstream model benefits from the work previous models in the system have done; they don’t have to go through the trouble of doing it themselves. Does this sharing of processed data happen at a specific layer in a deep learning network? Is it, for instance, taking place between the learner component and the task component?

AF: You’re correct that in many cases–though not all–the sharing of output is the area between the learner and task components, which we refer to as the latent space. In other cases, we may need higher dimensionality and share higher dimension feature maps earlier in the model. For example, in object detection, by the time we get to the latent space, they are typically 4×4, 16 pixels, in size. If we try to crop out a small object, for instance, we might only have one pixel. Way too small to work with. In this case, we would crop out of larger size feature maps from earlier stages, and then do additional convolutions for small object detection.

We used to describe models as a single directed acyclic graph (DAG). But today, they are more like a graph of graphs. That is, nodes in the graph may be a subgraph itself, and communication can occur between the underlying subgraphs.

M: You mentioned the WaveNet model in your sports broadcasting use case, where you can send audio to be translated. Is WaveNet one of those patterns we might find in your Deep Learning Design Patterns book?

AF: The book uses computer vision to teach design patterns. WaveNet is an audio model, so we won’t be covering it in the book. But we look at almost TK# of computer vision models.

M: So all of these models can be downloaded and fitted into just about any image-based machine learning system that uses a deep learning set-up with the three basic components of a stem, a learner, and a task?

AF: Basically, yes. Generally a model is composed of these components. But with model amalgamation, some of these can stand alone and be used separately, for instance, reused for other purposes, like a communication interface between two models or solving a subtask of a larger task.

M: Do you present composable models for all three components?

AF: Yes. I first walk the reader through the easy-to-understand procedural reuse style of design patterns, which I call Idiomatic. Then we look at patterns that learn to self-configure, which I call Composable.

M: So these design patterns can work as both independent models and as composable parts of a larger amalgamation of models acting as one?

AF: Yes

M: And they’re kind of generic, not data specific?

AF: The design of a model is not specific to the dataset; it’s specific to the task. How you train it, on the other hand, is very specific to the dataset, not the task.

M: Which is precisely why these models are so broadly useful, right? What kinds of things do these composable models do? Are there categories?

AF: Let’s go back to this idea of an orchestration of models working together and sharing things, in an amalgamation:  there are patterns, or models, for different stages or different tasks. So you have patterns for architecture, for connectivity and then for applications. Models for data augmentation, for the training stage, for fine-tuning parameters, and so on.

The orchestration of all these stages is now handled by another deep learning concept called pipelines. Pipelines have design patterns as well, and can be thought of as a graph of subgraphs.

M: Are there models for amalgamating the models?

AF: That is machine design.

M: Aha! And so we come full circle.

AF: Yes, we do.

M: Let’s end on a different subject altogether. You spent twenty years working in IT in Japan. Did that experience affect the way you think or behave, in your personal or professional life?

AF: That’s my next book! I was fully fascinated and dove deep into the complex social contracts in Japanese life and work, which are all based on the principle of Harmony.


Andrew Ferlitsch is an expert on computer vision and deep learning at Google Cloud AI Developer Relations, and formerly a principal research scientist for 20 years for Sharp Corporation of Japan, where he has 115 issued US patents and worked on emerging technologies: telepresence, augmented reality, digital signage, and autonomous vehicles. Currently, he reaches out to developer communities, corporations, and universities, teaching deep learning and evangelizing Google’s AI technologies.

 Pre-publication access to his book Deep Learning Design Patterns is available through Manning Publications on our browser-based liveBook reader here.