From Machine Learning Engineering in Action by Ben Wilson
Before we get into how successful planning phases for ML projects are undertaken, let’s go through a simulation of the genesis of a typical project at a company that doesn’t have an established or proven process for initiating ML work.
Planning: you want me to predict what?!
Let’s imagine that we work at an e-commerce company which is getting a taste for wanting to modernize their website. After seeing competitors tout massive sales gains by adding personalization services to their websites for years, the demand from the C-level staff is that the company needs to go all-in on recommendations. No one in the C-suite is entirely sure of the technical details about how these services are built, but they all know that the first group to talk to are the ML nerds. The business (in this case, the sales department leadership, marketing, and product teams) calls a meeting, inviting the entire ML team, with little added color to the invite apart from the title, “Personalized Recommendations Project Kick-off”.
Management and various departments that you’ve worked with have been happy with the small-scale ML projects that the team built (fraud detection, customer valuation estimation, sales forecasting, and churn probability risk models). Each of the previous projects, although complex in various ways from an ML perspective, were largely insular – handled within the scope of the ML team to come up with a solution that could be consumed by the various business units. At no point in any of these projects was there a need for subjective quality estimations or excessive business rules to influence the results; the mathematical purity of these solutions aren’t open to argument or interpretation – they’re either right or wrong.
Victims of your own success, the business approaches the team with a new concept: modernizing the website and mobile applications. They’ve heard about the massive sales gains and customer loyalty that comes along with personalized recommendations and they want you and the team to build one for incorporation to the website and the apps. They want each and every user to see a unique list of products greet them when they login. They want these products to be relevant, interesting to the user, and, at the end of the day, they want to increase the chances that the user buys these items.
After a brief meeting where examples from other websites are shown, they ask how long before the system will be ready. You estimate about two months, based on the few papers that you’ve read in the past about these systems, and set off to work. The team creates a tentative development plan during the next scrum meeting, and everyone sets off to try to solve the problem.
You, and the rest of the ML team, assume that what management is looking for is the behavior shown in many other websites in which products are recommended in a main screen. This, after all, is personalization in its most pure sense: a unique collection of products that an algorithm has predicted has relevance to an individual user. It seems straight-forward, you all agree, and begin quickly planning how to build a data set that shows a ranked list of product keys for each of the website and mobile app’s users, solely based on the browsing and purchase history of each member.
For the next several sprints (see callout above on sprints and agile development in ML), you all studiously work in isolation, testing dozens of implementations that you’ve seen in blog posts, consumed hundreds of papers worth of theory on different algorithms and approaches to solve an implicit recommendation problem, finally building out a minimum viable product (MVP) solution using alternating least squares (ALS) that achieves a root mean squared error (RMSE) of 0.2334, along with a rough implementation of ordered scoring for relevance based on prior behavior. Brimming with confidence that you’ve something amazing to show the business team sponsor, you head to the meeting armed with the testing notebook, some graphs showing the overall metrics, and some sample inference data that you believe are going to truly impress the team. You start by showing the overall scaled score rating for affinity, displaying the data as an RMSE plot, as shown in figure 1 below.
Figure 1. A fairly standard loss chart of RMSE for the affinity scores to their predicted values.
The response to showing the chart in figure 1 is lukewarm at best. When displayed, a bevy of questions arise, focused on what the data means, what the line that intersects the dots means, and how the data was generated begins to derail the discussion. Instead of a focused discussion about the solution and the next phase of what you’d like to be working on (increasing the accuracy), the meeting begins to devolve into a mix of confusion and boredom. In an effort to better explain the data, you show a quick table of rank effectiveness using non-discounted cumulative gain (NDCG) metrics to show the predictive power of a single user that was chosen at random, as shown in figure 2.
Figure 2. NDCG calculations for a recommendation engine for a single user. With no context, presenting raw scores like this does nothing beneficial for the DS team.
The first chart created a mild sense of perplexity, but the table in figure 2 brings complete and total confusion. No one understands what is being shown and can’t see the relevance to the project. The only thing on everyone’s mind is: “Is this what weeks of effort can bring? What has the Data Science team been doing all this time?”
During the explanation that the DS team is doing for these two visualizations, one of the marketing analysts begins looking up the product recommendation listing for one of the team members’ accounts in the sample data set that was provided for the meeting. The results of their recommendations (and their thoughts as they bring up the product catalog data for each of the recommendations in their list) are in figure 3 below.
Figure 3. The marketing analyst analyzing their recommendations with the power of imagery. Focusing on presentations which cater to your audience makes complex systems easier to reason about.
The biggest lesson that the DS team learned from this meeting wasn’t the necessity to validate the results of their model in a way that simulates how an end-user of the predictions will react. Although an important fact, and one which is discussed in the callout below, is that it’s trumped quite significantly in the realization that the reason that the model was received this poorly is that they didn’t plan for the nuances of this project properly.
The DS team hadn’t understood the business problem from the perspective of the team members in the room who knew where all of the proverbial ‘bodies were buried’ in the data and who have cumulative decades of knowledge around the nature of the data and the product.
How could they have done things differently?
The analyst that looked up their own predictions in figure 3 uncovered a great many problems that were obvious to them once seeing the predictions for their account. They saw the duplicated item data due to the retiring of older product IDs and they likewise instantly knew that the shoe division used a separate product ID for each color of a style of shoe, both core problems that caused a poor demo.
All of the issues found, causing a high risk of project cancellation, were due to improper planning of the project.
Basic planning for a project
Planning of any ML project typically starts at a high level. A business unit, executive, or even a member of the DS team comes up with an idea of using the DS team’s expertise to solve a challenging problem. Although typically little more than a concept at this early stage, this is a critical juncture in the lifecycle of a project.
In the scenario we’ve been discussing, the high-level idea is: ‘Personalization’. To an experienced DS, this could mean any number of things. To an SME of the business unit, it could mean many of the same concepts that the DS team could think of, but it may not. From this early point of ‘an idea’ to before even basic research begins, the first thing that everyone involved in this project should be doing is to have a meeting. The subject of this meeting should focus on one fundamental element.
Why are we building this?
It may sound like a hostile or confrontational question to ask. It may take some people aback when hearing it, but it’s one of the most effective and important questions to ask as it opens a discussion into the true motivations for why people want the project to be built. Is it to increase sales? Is it to make our external customers happier? Or is it to keep people browsing on the site for longer durations? If you’re not sure of the answer to this question, it might be worth looking into the benefits of cjm software. This is completely optional though.
Each of these nuanced answers can help to inform the goal of this meeting: defining the expectations of the output of any ML work. Going hand-in-hand with that, it also satisfies the measurement metric criteria for the model’s performance as well as attribution scoring of the performance in production (the score which is used to measure AB testing much later).
In the ‘failed scenario’ that we discussed earlier, the team failed to ask this important ‘why’ question. Figure 3.6 below shows the divergence in expectations from the business side and the ML side due to the fact that neither group was speaking about the essential aspect of the project and were instead occupied in mental silos of their own creating. The ML team, focusing entirely on the ‘how to solve’ the problem, and the business team had expectations of what would be delivered, wrongfully assuming that the ML team would ‘just understand it’.
Figure 4. A visual representation of the first main issue with the planning meeting. By not communicating expectations (the ‘why’ of the project), neither side had much common ground to make for a successful MVP.
Figure 4 sums up the planning process for the MVP. With extremely vague requirements, a complete lack of thorough communication about what’s expected as minimum functionality from the prototype, and a failure to reign in the complexity of experimentation, the demonstration is considered an absolute failure. Preventing outcomes like this can be achieved only in these early meetings when the project’s ideas are being discussed.
Continuing with this scenario, let’s take a look at what the MVP demonstration feedback discussion looks like to see the sorts of questions that should have been discussed during that early planning and scoping meeting. Figure 5 shows the questions and the underlying root causes of the misunderstandings present.
Figure 5. The results of the MVP presentation demo. Questions and their subsequent discussions could have happened during the planning phase to prevent all of these five core issues shown.
Although the example in this case is intentionally hyperbolic in nature, I’ve found that there are elements of this confusion present in many ML projects (those outside of primarily ML-focused companies), and this is to be expected. The problems that ML frequently intends to solve are quite complex, full of details which are specific and unique to each business (and business unit within a company), and fraught with disinformation surrounding the minute nuances of said details. What is important is to realize that these struggles are going to be an inevitable part of any project and that the best way to minimize their impact is to have a thorough series of discussions that aim to capture as many details about the problem, the data, and the expectations of the outcome as possible.
Assumption of business knowledge
This is a challenging issue, particularly for a company which is new to utilizing ML, or for a business unit at a company that has never worked with their ML team before.
For the business leadership perspective, an assumption was made that the ML team knew aspects of the business that, to their department, are considered widely held knowledge. Because there was no clear and direct set of requirements set out, this wasn’t identified as a clear requirement. With no SME from the business unit involved in guiding the ML team during data exploration, there was no way for them to know during the process of building the MVP either.
An assumption of business knowledge is, many times, a dangerous path to tread for most companies. At many companies, the ML practitioners are insulated from the inner workings of a business. With their focus mostly in the realm of providing advanced analytics, predictive modeling, and automation tooling, there’s scant time to devote to understanding the nuances of how and why a business is run. Although some rather obvious aspects of the business are known by all (i.e. “we sell product ‘x’ on our website”), there’s no reasonable expectation that the modelers should know that there’s a business process around the fact that some suppliers of goods are promoted on the site over others.
A good solution for arriving at these nuanced details is by having an SME from the group who is requesting a solution be built for them (in this case, people from the product marketing group) explain how they decide the ordering of products on each page of the website and app. Going through this exercise allows for everyone in the room to understand the specific rules that may be applied to govern the output of a model.
Assumption of Data Quality
The onus of the issue of duplicate product listings in the output of the demo isn’t entirely on either team. Although the ML team could certainly have planned for this to be an issue, they weren’t aware of it precisely in the scope of its impact. Even had they known, they likely would have wisely mentioned that correcting for this issue wouldn’t be a part of the demo phase (due to the volume of work involved in correcting for that and their request that the prototype not be delayed for too long).
The principal issue here is in not planning for it. By not discussing the expectations, confidence erosion happens in the business for the capabilities of the ML team, and as such, the objective measure of the success of the prototype is largely ignored as the business members focus solely on the fact that for a few users sample data, the first three hundred recommendations show nothing but four products in eighty different available shades and patterns.
For our use case here, the ML team believed that the data they used was, as told to them by the Data Engineering (DE) team, quite clean.
Reality, for most companies, is a bit more dire than what most think when it comes to data quality. Figure 3.8 below shows a summarization of two industry studies conducted by IBM and Deloitte about how their survey of thousands of companies were struggling with ML implementations, specifically noting the problems with data cleanliness. With statistics as grim as these, it’s important that at the outset of every project, a thorough evaluation of data sources for modeling be conducted and issues identified early to fix them prior to development phases for the project begin.
Figure 6. The impact of data quality issues on companies engaging in ML project work. Data quality issues are common, and as such, should always be vetted during the early stages of project work.
It’s unimportant to have ‘perfect data’. Even amongst the companies in figure 6 above which are successful in deploying many ML models to production, they still struggle (75% as reported) with data quality issues regularly. These problems with data are a byproduct of the frequently incredibly complex systems which are generating the data, years (if not decades) of technical debt, and the expense associated with designing systems that don’t generate data that has issues with it. The proper way to handle these known problems is to anticipate them, validate the data which is involved in any project before modeling begins, and to ask questions about the nature of the data to the SMEs who are most familiar with it.
For the case of this recommendation engine, the ML team failed to not only ask questions about the nature of the data that they were modeling (namely, “do all products gets registered in our systems in the same way?”), but also failed to validate the data through analysis. Pulling some rather quick statistical reports may have uncovered this issue quite clearly, particularly if the unique product count of shoes was orders of magnitude higher than any other category. “Why do we sell this many shoes?”, posed during a planning meeting, could have instantly uncovered the need for resolution of this issue with the shoes, but also a deeper inspection and validation of all product categories to ensure that the data going into the models was correct.
Assumption of functionality
In this instance, the business leaders are concerned that the recommendations show a product which was purchased the week before. Regardless of what the product may be (consumable or not), the planning failure here’s in expressing how off-putting this is to the end-user to see this happen.
The ML team’s response of ensuring that this key element needs to be a part of the final product is a valid response. At this stage of the process, although it’s upsetting to see results like this from the perspective of the business unit, it’s nearly inevitable. The path forward in this aspect of the discussion should be to scope the feature addition work, make a decision on whether to include it in a future iteration, and move on to the next topic.
To this day I haven’t worked on an ML project where this hasn’t come up during a demo. Valid ideas for improvements come from these meetings (this is one of the primary reasons to have them, after all – to make the solution better!). The worst things to do are either dismiss them outright or blindly accept the implementation burden. The best thing to do is to present the cost (time, money, and human capital) for the addition of the improvement and let the internal customer decide if it’s worth it.
Curse of knowledge
The ML team, in this discussion point, instantly went ‘full nerd’. When communicating the inner details of things that have been tested, it always falls on deaf ears. Assuming that everyone in a room understands the finer details of a solution as anything but a random collection of pseudo-scientific buzz-word babble is doing a disservice to yourself as an ML practitioner (you won’t get your point across) and to the audience (they’ll feel ignorant and stupid, frustrated that you assume that should know such a specific topic).
The better way to discuss the fact that you tried a number of solutions that didn’t pan out: speak in as abstract terms as possible.
“We tried a few approaches, one of which might make the recommendations much better, but it will add a few months to our timeline. What would you like to do?”
Handling of complex topics in layperson context always works much better than delving into deep technical detail. If your audience is interested in more in-depth technical discussion, then gradually ease into deeper technical depth until the question is answered. It’s never a good idea to buffalo your way through an explanation by speaking in terms that are unreasonable to expect them to understand.
Without proper planning, the ML team will likely experiment on a lot of different approaches, the most state of the art that they can find in the pursuit of providing the best possible recommendations possible. Without focus on the important aspects of the solution during the planning phase, this chaotic approach of working solely on the ‘model purity’ can lead to a solution that misses the point of the entire project.
After all, sometimes the most accurate model isn’t the best solution. Most of the time, a good solution is one that incorporates the needs of the project; and that generally means keeping the solution as simple as possible to meet those needs. Approaching project work with that in mind helps to alleviate the indecisions and complexity surrounding complex alternatives of ‘the best solution’ from a modeling perspective.
That first meeting
As we discussed earlier in our simulation about starting the recommendation engine project in the worst possible way, we found a number of problems with how the team approached planning. How did they get to that state of failing to communicate what the project should focus on, though?
Although everyone on the ML team was quietly thinking about algorithms, implementation details, and where to get the data to feed into the model, they were too consumed to ask the questions that should have been posited. No one was asking details about how it should work, what types of restrictions need to be in place on the recommendations, or if there should be a consideration to how the products are displayed within a sorted ranked collection. They were all focused on the ‘how’ instead of the ‘why’ and ‘what.
Conversely, the internal marketing team who brought the project to the ML team didn’t discussing their expectations clearly. With no malicious intent, their ignorance of the methodology of developing this solution coupled with their intense knowledge of the customer and how they want the solution to behave created a perfect recipe for a perfect implementation disaster.
How could this have been handled differently? How could that first discussion have been orchestrated to ensure that the greatest number of hidden expectations which the business unit team members hold can be openly discussed in the most productive way? It can be as easy as starting with a single question: “What do you do now to decide what products to display in what places?” In figure 7 below, let’s look at what posing that question may have revealed and how it could have informed the critical feature requirements that should have been scoped for the MVP.
Figure 7. A far more effective planning and scoping meeting for the recommendation engine project.
Figure 7 shows that not every idea is a fantastic one. Some are beyond the scope of budget (time, money, or both). Others are beyond the limits of our technical capabilities (the “things that look nice” request). The important thing to focus on, though, is that two key critical features were identified, and a potential additive future feature that can be put in the backlog for the project.
Although this figure’s dialogue may appear to be quite caricatural, this is a nearly verbatim transcription of a meeting I was part of. Although I was stifling laughter a few times at some of the requests, I found the meeting to be invaluable. Spending a few hours discussing all of the possibilities that SMEs see was able to be give me and my team a perspective that we hadn’t considered, in addition to revealing key requirements about the project that we never would have guessed or assumed without hearing them from the team.
The one thing to make sure to avoid in these discussions is speaking about the ML solution. Keep notes for you and fellow DS team members to discuss later. It’s critical that you don’t drag the discussion away from the primary point of the meeting (gaining insight into how the business solves the problem currently).
One of the easiest ways to approach this subject is, as shown in the callout below, by asking how the SMEs that currently do (or interact with the data supporting this functionality) their jobs. This methodology is precisely what informed the line of questioning and discussion in figure 7.
Plan for demos. Lots of demos.
Yet another cardinal sin that the ML team violated in their presentation of their personalization solution to the business was illustrated by them attempting to show the MVP only once. Perhaps their sprint cadence wasn’t such that they could generate a build of the model’s predictions at times that were convenient, or they didn’t want to introduce slow-down into their progress towards having a true MVP to show to the business. Whatever the reason may be, the team wasted time and effort as you try to save time and effort. They were clearly in the top portion of figure 8 below.
Figure 8. Timeline comparison of feedback-focused demo-heavy project work and internal-only focused development. Although the demonstrations take time and effort, the rework that they save is invaluable.
Even though Agile practices were used within the ML team, to the marketing team, the MVP presentation was the first demo that they had seen in two months of work. At no point in those two months did a meeting take place to show the current state of experimentation, nor was there a plan that was communicated about the cadence of seeing results from the modeling efforts.
Without frequent demos as features are built out, the team at large operates in the dark with respect to the ML aspect of the project. The ML team, meanwhile, is missing out on valuable time-saving feedback from SME members who are able to halt feature development and help to refine the solution.
For most projects involving ML of sufficient complexity, there are far too many details and nuances to confidently approach building out dozens of features without having them reviewed. Even if the ML team is showing metrics for the quality of the predictions, aggregate ranking statistics that ‘conclusively prove’ the power and quality of what they’re building, the only people in the room who care about this are the ML team. In order to effectively produce a complex project, the SME group – the marketing group – needs to provide feedback based on data that they can consume. Presenting arbitrary or complex metrics to that team is bordering on intentional obtuse obfuscation, which only hinders the project and stifles the critical ideas that are required to make the project successful.
By planning for demos ahead of time, at particular cadences, the ML-internal agile development process can adapt to the needs of the business experts to create a more relevant and successful project. They can embrace a true Agile approach: of testing and demonstrating features as they’re built, adapting their future work and adjusting elements in a highly efficient manner. They can help to ensure the project sees the light of day.
Experimentation by solution building: wasting time for pride’s sake
Looking back that the unfortunate scenario of the ML team building a prototype recommendation engine for the website personalization project, their process of experimentation was troubling, but not only for the business.
Without a solid plan in place about what they’re trying and how much time and effort they’re placing on the different solutions they agreed on pursuing, a great deal of time (and code) was unnecessarily thrown away.
Coming out of their initial meeting, they went off on their own as a team, beginning their siloed ideation meeting with a brainstorming about which algorithms might best be suited for generating recommendations in an implicit manner. Three hundred or more web searches later, they came up with a basic plan of doing a head-to-head comparison of three main approaches: an ALS model, an SVD model, and a Deep Learning recommendation model. Having an understanding of the features that were required to meet the minimum requirements for the project, three separate groups began building what they could in a good-natured competition.
The biggest flaw in approaching experimentation in this way is the sheer scope and size of the waste involved in doing bake-offs like this. Approaching a complex problem by way of a hackathon-like methodology might seem fun to some, not to mention being far easier to manage from a process perspective by the team lead (you’re all on your own – whoever wins, we go with that!), but it’s an incredibly irresponsible way to develop software.
This flawed concept, solution building during experimentation, is juxtaposed with the far more efficient (but, some would argue ‘less fun’) approach of prototype experimentation in figure 9 below. With periodic demos (either internally to the ML team or to the broader external cross-functional team), the project’s experimentation phase can be optimized to have more hands (and minds) focused on getting the project as successful as it can be as fast as is possible.
Figure 9. Comparison of multiple-MVP development (top) vs. experimentation culling development (bottom). By culling options early, more work (at a higher quality and in less time) can get done by the team.
As shown in the top section of figure 9 above, approaching the problem of a model bake-off without planning for prototype culling runs two primary risks. Firstly, in the top portion, Model A was found to be challenging to incorporate the first primary feature that the business dictated was critical.
Because no evaluation was done after the initial formulation of getting the model to work, a great deal of time was spent trying to get the feature built out to support the requirements. After that was accomplished, when moving on to the second most critical feature, the team members realized that they couldn’t implement the feature in enough time for the demo meeting, effectively guaranteeing that all of the work that was put into model A is going to be thrown away.
The other two approaches both, due to being short-staffed on the implementation of their prototypes, were unable to complete the third critical feature. As a result, none of the three approaches satisfy the critical project requirements. This delay to the project, due to the multi-discipline nature of it, affects other engineering teams. What the team should have done instead was follow the path of the bottom section ‘Prototype Experimentation’.
In this approach, the teams met with the business units early, communicating ahead of time that the critical features wouldn’t be in at this time. They chose instead to make a decision on the raw output of each model type that was under test. After the decision from this meeting being made to focus on a single option, the entire ML team’s resources and time could be focused on implementing the minimum required features (with an added check-in demo in between the presentation of the core solution to ensure that they were on the right track) and get to the prototype evaluation sooner.
Focusing on early and frequent demos, even though features weren’t fully built out yet, helped to maximize both staff resources and get valuable feedback from the SMEs. In the end, all ML projects are resource constrained. By narrowly focusing on the fewest and most potentially successful options as early as possible, even a lean set of resources can create successful complex ML solutions.
That’s all for this article. If you want to learn more, check out the book on Manning’s liveBook platform here.