From Machine Learning Engineering in Action by Ben T. Wilson

This article talks about the need to carefully plan a machine learning project—before you start it!

Take 40% off Machine Learning Engineering in Action by entering fccwilson4 into the discount code box at checkout at

Experimentation needs boundaries

Imagine this: we’ve gone through the exhausting slog of preparing everything that we can up until this point for the recommendation engine project. Meetings have been attended, concerns and risks voiced, plans for design have been conducted and based on the research phase, we have a clear set of models to try out. It’s finally time to play some jazz, get creative, and see if we can make something that’s not total garbage.

Before we get too excited though, it’s important to realize that, with all other aspects of ML project work, we should be doing things in moderation and with a thoughtful purpose behind what we’re doing. This applies more so to the experimentation phase than any other aspect of the project – primarily because this is one of the few completely siloed off phases of the project.

What might we do with this personalized recommendation engine if we had all of the time and resources in the world? Would we research the latest white and try to implement a completely novel solution (you may, depending on your industry and company)? Would we think about building a broad-ensemble of recommendation models to cover all of the ideas that we have (let’s do a collaborative filtering model for each of our customer cohorts based on their customer lifetime value scores for propensity and their general product-group affinity, then merge that with an FP-Growth market basket model to populate sparse predictions for certain users, et al.)? Perhaps we would build a graph-embedding to a deep learning model that would find relationships in product and user behavior that could potentially create the most sophisticated and accurate predictions that are achievable. All of these are neat ideas and could potentially be worthwhile if the entire purpose of our company was to recommend items to humans. However, these are all very expensive to develop in the currency that most companies are most strapped for: time.

We need to understand that time is a finite resource, as is the patience of the business unit that requested the solution. The scoping of the experimentation is tied directly to the resources available: how many Data Scientists there are on the team, what the number of options that we are going to attempt to compare against one another, and, most critically, the time that we have to complete this in. The final limitations that we need to control for, knowing that there are limited constraints on time and developers, is that there is only so much that can be built within an MVP phase.

It’s tempting to want to fully build out a solution that you have in your head and see it work exactly as you’ve designed it. This works great for internal tools that are helping your own productivity or projects that are internal to the ML team. But pretty much every other thing that an ML Engineer or Data Scientist is going to work on in their careers has a customer aspect to it, be it an internal or external one. This will mean that you have someone else depending on your work to solve a problem. They will have a nuanced understanding of the needs of the solution that might not align with your assumptions.

Not only is it, as we have mentioned earlier, incredibly important to include them in the process of aligning the project to the goals, but it’s potentially very dangerous to fully build out a tightly coupled and complex solution without getting their input on the validity of what you’re building to the issue of solving the problem.

The way of solving this issue of involving the SMEs in the process is to set boundaries around prototypes that you’ll be testing.

Set a time limit

Perhaps one of the easiest ways to stall or cancel a project is by spending too much time and effort into the initial prototype. This can happen for any number of reasons, but most of them, I’ve found, are due to poor communication amongst a team, incorrect assumptions by non-ML team members about how the ML process works (refinement through testing with healthy doses of trial, error, and reworking mixed in), or an inexperienced ML team assuming that they need to have a ‘perfect solution’ before anyone sees their prototypes.

The best way to prevent this confusion and complete wasting of time is to set limits on the time allotted for experimentation surrounding vetting of ideas to try, which, by its very nature, will eliminate the volume of code that is written at this stage. It should be very clear to all members of the project team that the vast majority of the ideas expressed during the planning stages are not going to be implemented for the vetting phase; rather, in order to make the crucial decision about which implementation to go with, the bare minimum amount of the project should be tested. Below, in figure 1 shows the most minimalistic amount of implementation that needs to be done to achieve the goals of the experimentation phase. Any additional work, at this time, does not serve the need at the moment: to decide on an algorithm that will work well at scale, at cost, and meets objective and subjective quality standards.

Figure 1 Mapping the high-level experimentation phase for the teams testing ideas.

In comparison, figure 2 shows a simplified view of what some of the core features might be based on the initial plan from the planning meeting.

Figure 2 A pseudo architectural plan for the expanded features involved in the development phase, realized by conducting effective experimentation and getting feedback from the larger team.

Based on the comparison of these two figures, figure 1 and figure 2, it should be easy to imagine the increasing scope of work that is involved in the transition from the first plan to the second. There are entirely new models that need to be built, a great deal of dynamic run-specific aggregations and filtering that need to be done, custom weighting to be incorporated, and potentially dozens of additional data sets to be generated. None of these elements solves the core problem at the boundary of experimentation: which model should we go with for developing?

Limiting the time to make this decision will prevent (or at least minimize) the natural tendency of most ML practitioners to want to build a solution, regardless of plans that have been laid out. Sometimes forcing less work to get done is a good thing, for the cause of reducing churn and making sure the right elements are being worked on.

A note on experimental code quality

Experimental code should be a little ‘janky’. It should be scripted, commented-out, ugly, and nigh-untestable. It should be a script, filled with charts, graphs, print statements, and all manner of bad coding practices.

It’s an experiment, after all. If you’re following a tight timeline to get an experimental decision-of-action to be made, you likely won’t have time to be creating classes, methods, interfaces, enumerators, factory builder patterns, couriering configurations, etc. You’re going to be using high-level APIs, declarative scripting, and a static data set.

Don’t worry about the state of the code at the end of experimentation. It should serve as a reference for development efforts in which proper coding is done (and under no circumstances should experimental code be expanded upon for the final solution), wherein the team is building maintainable software, using standard software development practices.

But for this stage, and only this stage, it’s usually ok to write some pretty horrible-looking scripts. We all do it sometimes.

Can you put this into production? Would you want to maintain it?

While the primary purpose of an experimentation phase, to the larger team, is to make a decision on the predictive capabilities of a model’s implementation, one of the chief purposes internally, amongst the ML team, is to determine if the solution is tenable for the team.

The ML team lead, architect, or senior ML person on the team should be taking a very close look at what is going to be involved with this project, asking some difficult questions, while producing some very honest answers to them. Some of the most important ones being:

  • How long is this solution going to take to build?
  • How complex is this code base going to be?
  • How expensive is this going to be to train based on the schedule it needs to be retrained at?
  • How much skill is there on my team to be able to maintain this solution? Does everyone know this algorithm / language / platform?
  • How quickly will we be able to modify this solution should something dramatically change with the data that it’s training or inferring on?
  • Has anyone else reported success with using this methodology / platform / language / API? Are we re-inventing the wheel here, or are we building a square wheel?
  • How much additional work is the team going to have to do to make this solution work while meeting all of the other feature goals?
  • Is this going to be extensible? When the inevitable version 2.0 of this is requested, will we be able to enhance this solution easily?
  • Is this testable?
  • Is this auditable?

There have been innumerable times in my career where I’ve either been the one building these prototypes or been the one asking these questions while reviewing someone else’s prototype. Although an ML practitioner’s first reaction to seeing results is frequently “let’s go with the one that has the best results”, many times the ‘best one’ is the one that ends up being either nigh-impossible to fully implement or is a nightmare to maintain. It is of paramount importance to weigh these ‘future thinking’ questions about maintainability and extensibility, whether it is regarding the algorithm in use, the API that calls the algorithm, or the very platform that it’s running on. If you take the time to properly evaluate the production-specific concerns of an implementation, instead of simply the predictive power of the model’s prototype, it can mean the difference between a successful solution and vaporware.

TDD vs RDD vs PDD vs CDD for ML projects

It seems as though there are an infinite array of methodologies to choose from when thinking about how to develop software. From waterfall, to the agile revolution (and all of the myriad flavors of that), there are benefits and drawbacks to each. We’re not going to be discussing the finer points of which development approach might be best for particular projects or teams. There have been absolutely fantastic books published that are great resources for exploring these topics in depth (Becoming Agile and Test Driven are notable ones), which I highly recommend reading to improve the development processes for ML projects. What is worth discussing here, however, is four general approaches to ML development (one being a successful methodology, the others being cautionary tales).

Test Driven Development (TDD) or Feature Drive Development (FDD)

While pure TDD is incredibly challenging (and certainly unable to achieve the same test coverage in the end that traditional software development can) to achieve for ML projects (mostly due to the non-deterministic nature of models themselves), and a pure FDD approach can cause significant rework during the duration of the project, most successful approaches to ML project work embrace aspects of both of these development styles. Keeping work incremental, adaptable to change, and focused on modular code that is not only testable but focused entirely on required features to meet the project guidelines is a proven approach that helps to deliver the project on time while also creating a maintainable and extensible solution.

These Agile approaches will need to be borrowed from and adapted in order to see an effective development strategy that works for not only the development team, but also for an organization’s general software development practices, as well as the specific design needs for different project work can dictate slightly different approaches to how a specific project is implemented.

Why would I want to use different development philosophies?

When discussing ML as a broad topic, one runs the risk of over-simplifying an incredibly complex and dynamic discipline. Since ML is used for such a wide breadth of use cases (as well as having such a broad set of skills, tools, platforms, and languages), the magnitude of difference in complexity amongst various projects is truly astounding.

For a project as simple as “we would like to predict customer churn”, a TDD-heavy approach can be a successful way of developing a solution. A model and inference pipeline for implementations of churn prediction models is typically rather simple (the vast majority of the complexity is in the data engineering portion), and as such, modularizing code and building the code base in a way that each component of the data acquisition phase can be independently tested can be very beneficial to an efficient implementation cycle and an easier to maintain final product.

On the other hand, a project that is as complex as, say, an ensemble recommender engine that uses real-time prediction serving, has hundreds of logic-based reordering features, employs the predictions from several models, and has a large multi-discipline team working on it could greatly benefit from using the testability components of TDD, but throughout the project use the principles of FDD to ensure that only the most critical components are developed as needed can help to reduce feature sprawl.

Each project is unique, and the team lead or architect in charge of the implementation from a development perspective should plan accordingly with how they would like to control the creation of the solution. With the proper balance in place of best practices from these proven standards of development, a project can hit its required feature-complete state in a position of being at the lowest-risk-to-failure point so that the solution is stable and maintainable while in production.

Prayer Driven Development (PDD)

At one point, all ML projects were this. In many organizations that are new to ML development, it still is. Before the days of well-documented high-level APIs to make modeling work easier, everything was a painful exercise in hoping that what was being scratched and cobbled together would work at least well enough that the model wouldn’t detonate in production. That hoping (and praying) for things to ‘just work please’ isn’t what I’m referring to here, though.

What I’m facetiously alluding to with this title is rather the act of frantically scanning for clues about how to solve a particular problem by following either bad advice on internet forums, or following along with someone (who likely doesn’t have much more actual experience than the searcher) who has posted a blog covering a technology or application of ML that seems somewhat relevant to the problem that they’re trying to solve, only to find out, months later, that the magical solution that they were hoping for was nothing more than fluff.

Prayer driven ML development is the process of seeking out truth to problems that one doesn’t know how to solve into the figurative hands of some all-knowing person who has solved it before, all in the goal of eliminating the odious task of doing proper research and evaluation of technical approaches to the problem. Taking such an easy road rarely ends well; with broken code bases, wasted effort (“I did what they did – why doesn’t this work?!!”) and, in the most extreme cases, project abandonment, this is a problem and development anti-pattern that is growing in magnitude and severity in recent years.

The most common effects that I see happen from this approach of ML ‘copy-culture’ are either that people who embrace this mentality want to either use a single tool for every problem (yes, XGBoost is a solid algorithm. No, it’s not applicable to every supervised learning task) or to try only the latest and greatest fad (“I think we should use TensorFlow and Keras to predict customer churn”).

If all you know is XGBoost, then everything looks like a gradient boosting problem.

Limit yourself in this manner (not doing research, not learning or testing alternate approaches, and restricting experimentation or development to a limited set of tools) and the solution will reflect these limitations and self-imposed boundaries. In many cases, latching onto a single tool or a new fad and forcing it onto every problem creates sub-optimal solutions or, more disastrously, forces one to write far more lines of unnecessarily complex code in order to fit a square peg into a round hole.

A good way of detecting if the team (or yourself) is on the path of PDD is to see what is planned for a prototyping phase for a project. How many models are being tested? How many frameworks are being vetted? If the answer to either of these is, “one”, and no one on the team has solved the particular problem several times before, then you’re doing PDD. And you should stop.

Chaos Driven Development (CDD)

Also known as ‘cowboy development’ (or hacking), CDD is a process of skipping the experimentation and prototyping phases altogether. It may seem easier at first since there isn’t much refactoring happening early on. However, using such an approach of building ML in an as-needed basis during project work is fraught with peril.

As modification requests and new feature demands arise through the process of developing a solution, the sheer volume of rework (sometimes from scratch) slows the project to a crawl. By the end (if it makes it that far), the fragile state of the ML team’s sanity will entirely prevent any future improvements or changes to the code due to the spaghetti nature of the implementation.

If there is one thing that I hope that readers can take away from this book, it’s to avoid this development style. I’ve not only been guilty of it in my early years of ML project work, but I’ve seen it be one of the biggest reasons for project abandonment in companies that I’ve worked with. If you can’t read your code, fix your code, or even explain how it works, it’s probably not going to work well.

Resume Driven Development (RDD)

By far the most detrimental development practices, designing an over-engineered ‘show-off’ implementation to a problem is one of the primary leading causes of projects being abandoned after they are in production.

RDD implementations are generally focused on a few key characteristics:

  • A novel algorithm is involved
  • Unless it’s warranted due to the unique nature of the problem
  • Unless multiple experienced ML experts agree that there is not an alternative solution
  • A new (unproven in the ML community) framework for executing the project’s job (with features that serve no purpose in solving the problem) is involved
  • There’s not really an excuse for this nowadays
  • A blog post (or series of blog posts) about the solution are being written during development (after the project done is fine, though!)
  • This should raise a healthy suspicion amongst the tea
  • There will be time to self-congratulate after the project is released to production, has been verified to be stable for a month, and impact metrics have been validated.
  • An overwhelming amount of the code is devoted to the ML algorithm as opposed to feature engineering or validation
  • For the vast majority of ML solutions, the ratio of feature engineering code to model code should always be > 4x.
  • An abnormal level of discussion in status meetings about the model, rather than the problem to be solved
  • We’re here to solve a business problem, aren’t we?

This isn’t to say that novel algorithm development or incredibly in-depth and complex solutions aren’t called for. They most certainly can be. But they should only be pursued if all other options have been exhausted. If someone were to go from a position of having nothing in place at all to proposing a unique solution that has never been built before, objections should be raised. This development practice and the motivations behind it are not only toxic to the team that will have to support the solution but will poison the well of the project and almost guarantee that it will take longer, be more expensive, and do nothing apart from pad the developer’s resume.

If you want to learn more, check out the book on Manning’s browser-based liveBook reader here.