|
An excerpt from Bayesian Optimization in Action by Quan Nguyen What is Bayesian Optimization? What problem(s) does it propose to solve? If you deal with Machine Learning in your job and you’re running into problems with things like black box optimization and hyperparameter tuning, then Bayesian optimization is something you should learn more about. Read on if you want to learn more. Bayesian optimization isn’t as difficult as you might think! |
Take 25% off Bayesian Optimization in Action by entering fccnguyen into the discount code box at checkout at manning.com.
Introduction to Bayesian optimization
I’m very happy that you are interested in this book. On a high level, Bayesian optimization is an optimization technique that may be applied when the function we are trying to optimize is a black box and expensive to evaluate, which encompasses many important tasks including hyperparameter tuning. Using Bayesian optimization could accelerate this search procedure and help us locate the optimum of the function as quickly as possible.
As a machine learning practitioner, you might have heard of the term Bayesian optimization from time to time, or you might never have encountered it before. While Bayesian optimization has enjoyed enduring interest from the machine learning (ML) research community, it’s not as commonly used and talked about as other ML topics in practice. Why? Some might say Bayesian optimization has a steep learning curve: you need to understand calculus, use some probability, and overall be an experienced ML researcher to use Bayesian optimization in your application. Our goal for this book is to dispel the message that Bayesian optimization is difficult to use, and show that the technology is more accessible than one would think.
Throughout this book, we will see a lot of illustrations, plots, and of course, code, which will help make whichever the topic currently being discussed more straightforward and concrete. You will learn how each component of Bayesian optimization works on a high level and how to implement them using state-of-the-art libraries in Python. Another hope of mine for the accompanying code is that it would help you hit the ground running with your own projects, as the Bayesian optimization framework is very general and “plug-and-play.” The exercises are also helpful in this regard.
Generally, this book is intended to be useful to your machine learning needs and overall a fun read. Now, let’s take a closer look at the problem that Bayesian optimization sets out to solve.
Finding the optimum of an expensive, black-box function is a difficult problem
As mentioned above, hyperparameter tuning in ML is one of the most common applications of Bayesian optimization. We will explore this problem as an example of the general problem of black-box optimization. This will help us understand why Bayesian optimization is needed.
Hyperparameter tuning as an example of an expensive black-box optimization problem
Hyperparameter tuning is a common task in machine learning, so let’s look at an example of what this task is. This will give us a clearer idea of what factors are involved and what decisions need to be made when tuning the hyperparameters of a machine learning model.
Say you want to train a neural network on a large data set, but you are not sure how many layers this neural net should have. You know that the architecture of a neural net is a make-or-break factor in deep learning, so you perform some initial testing and obtain the results shown in Table 1.
Table 1. An example of a hyperparameter tuning task. Our task is to decide how many layers the neural network should have in the next trial in the search for the highest accuracy. It’s difficult to decide which number we should try next.
Number of layers |
Accuracy on the test set |
5 |
0.72 |
10 |
0.81 |
20 |
0.75 |
The best accuracy you have found, 81%, is good, but you think you can do better with a different number of layers. Unfortunately, your boss has set a deadline for you to finish implementing the model. And since training a neural net on your large data set takes several days, you only have a few trials remaining before you have to decide how many layers your network should have. With that in mind, you want to know what other values you should try so that you can find the number of layers giving the highest possible accuracy.
This task is typically called hyperparameter tuning in ML, where you want to find the best setting (hyperparameter values) for your model so as to optimize some performance metric such as predictive accuracy. In our example, the hyperparameter of our neural net is its depth (the number of layers). If you are working with a decision tree, common hyperparameters are the maximum depth, the number of samples, and the split criterion. With a support-vector machine, you could tune the regularization term and the kernel. Since the performance of a model very much depends on its hyperparameters, hyperparameter tuning is an important component of any ML pipeline.
Figure 1. Compute cost of training large neural networks has been steadily growing, making hyperparameter tuning increasingly difficult.
If this is a typical real-world data set, this process could take a lot of time and resources. Figure 1 from OpenAI (https://openai.com/blog/ai-and-compute/) shows that as neural networks keep getting larger and deeper, the amount of computation necessary (measured in petaflop/s-days) increases exponentially.
This is to say that training a model on a large data set is quite involved and takes significant effort. Further, you want to identify the hyperparameter values that give the best accuracy, so training will have to be done many times. How should you go about choosing which values to parameterize your model so that you can zero in on the best combination as quickly as possible? That is the central question in hyperparameter tuning.
Getting back to our neural net example in Table 1, what is the number of layers that we should try next, so that we could find a higher accuracy than 81%? Some value between 10 layers and 20 layers is promising, since at both 10 and 20, we do better than having 5 layers. But what exact value we should inspect next is still not obvious, since there may still be a lot of variability among any number between 10 and 20. When we say “variability”, we implicitly talk about our uncertainty regarding how the test accuracy of our model behaves as a function of the number of layers. Even though we know 10 layers lead to 81% and 20 layers lead to 75%, we cannot say for certain what value, say, 15 layers would yield. This is to say we need to account for our level of uncertainty when considering these values between 10 and 20.
Further, what if some number greater than 20 will actually give us the highest accuracy possible? This is the case for many large data sets, where a sufficient depth is necessary for a neural net to learn anything useful. Or, though unlikely, what if a small number of layers (fewer than 5) is actually what we need?
How should we explore these different options in a principled way, so that when our time runs out and we have to report back to our boss, we can be sufficiently confident that we have arrived at the best number of layers for our model? This question is an example of the general problem called “expensive black-box optimization,” which we will discuss next.
The problem of expensive, black-box optimization
The problem of expensive, black-box optimization is what Bayesian optimization aims to solve. Understanding why this is such a difficult problem will help us understand why Bayesian optimization is preferred over simpler, more naïve approaches.
In this problem, we have black-box access to a function (some input–output mechanism), and our task is to find the input that maximizes the output of this function. The function is often called the objective function, as optimizing it is our objective, and what we want to find is the optimum of the objective function, the input that yields the highest function value.
Hyperparameter tuning belongs to this class of expensive black-box optimization problems, but it is not the only one! Imagine any procedure in which we are trying to find some settings/parameters that optimize a process, but we do not know how the different settings influence and control the result of the process.
Further, trying out a particular setting and observing the result it has on the target process (the objective function) is time-consuming, expensive, or costly in some other sense.
The entire procedure is summarized in Figure 2.
Figure 2. The framework of a black-box optimization problem. We repeatedly query the function values at various locations to find the global optimum.
Introducing Bayesian optimization
With the problem of expensive black-box optimization in mind, we will now introduce Bayesian optimization as a solution to this problem. This will give you a high-level idea of what Bayesian optimization is and how it leverages probabilistic machine learning to optimize expensive, black-box functions.
In a Bayesian optimization procedure, we make decisions based on the recommendation of a Bayesian optimization algorithm. Once we have taken the Bayesian optimization-recommended action, the Bayesian optimization model gets updated based on the result of that action, and proceeds to recommend the next action to take. This process repeats until we are confident that we have zeroed in on the optimal action.
There are two main components to this workflow:
- A machine learning model that learns from the observations we make and makes predictions about the values of the objective functions on unseen data points
- An optimization policy that makes decisions about where to make the next observation by evaluating the objective in order to locate the optimum
Let’s introduce each of these two components.
Modeling with a Gaussian process
Bayesian optimization works by first fitting a predictive machine learning model on the objective function we are trying to optimization—sometimes this is called the surrogate model, as it acts as a surrogate between what we believe the function to be from our observations and the function itself. The role of this predictive model is very important, as its predictions inform the decisions made by a Bayesian optimization algorithm and therefore directly affect optimization performance.
In almost all cases, a Gaussian process (GP) is employed for this role, which we will examine in this subsection. To an ML practitioner, GPs might not be the most popular class of models, compared to, say, decision trees, support-vector machines, or neural networks. However, as we will see time and again throughout this book, GPs come with a unique and essential feature: they do not produce point-estimate predictions like the models listed above; instead, their predictions are in the form of probability distributions.
Probabilistic predictions are key in Bayesian optimization, allowing us to quantify uncertainty in our predictions, which in turn helps trade off risk for reward when making decisions.
Making decisions with a Bayesian optimization policy
In addition to a GP as a predictive model, we also need in Bayesian optimization a decision-making procedure. This is the second component in Bayesian optimization, which takes in the predictions made by the GP model and reasons about how to best evaluate the objective function, so that the optimum may be located efficiently.
Combining the Gaussian process and the optimization policy to form the optimization loop
Unlike a supervised learning task in which we just fit a predictive model on a training data set and make predictions on a test set, a Bayesian optimization workflow is what’s typically called active learning, a subfield in machine learning where we get to decide which data points our model learns from, and that decision-making process is in turn informed by the model itself.
As we have said, the GP and the policy are the two main components of this Bayesian optimization procedure. If the GP does not model the objective well, then we will not be able to do a good job informing the policy of the information contained in the training data. On the other hand, if the policy is not good at assigning high scores to “good” points and low scores to “bad” points (where “good” here means helpful in helping locate the global optimum), then our subsequent decisions will be misguided and most likely achieve bad results.
In other words, without a good predictive model such as a GP, we won’t be able to make good predictions with calibrated uncertainty. Without a policy, we can make good predictions but we won’t make good decisions.
Care needs to go into both components of the framework; this is why the two main parts of the book will be about modeling with GPs and decision making with Bayesian optimization policies.
Next, we summarize the key skills that you will be learning throughout the book.
What will you learn in this book?
This book gives you a deep understanding of the Gaussian process model and the Bayesian optimization task. You will learn how to implement a Bayesian optimization pipeline in Python using state-of-the-art tools and libraries. You will be exposed to a wide range of modeling and optimization strategies when approaching a Bayesian optimization task. By the end of the book, you will be able to:
- Implement high-performance Gaussian process models using GPyTorch, the premiere GP modeling tool in Python, visualize and evaluate their predictions, choose appropriate parameters for these models, and implement extensions such as variational Gaussian processes and Bayesian neural networks to scale to big data,
- Implement a wide range of Bayesian optimization policies using the state-of-the-art Bayesian optimization library BoTorch, which integrates nicely with GPyTorch, and inspect as well as understand their decision-making strategies,
- Approach different specialized settings such as batch, constrained, and multiobjective optimization using the Bayesian optimization framework, and
- Apply Bayesian optimization to a real-life task such as tuning the hyperparameters of a machine learning model.
Furthermore, we will be using real-world examples and data in the exercises to consolidate what we learn in each chapter. We will run our algorithms on the same data in many different settings set so that we can compare and analyze the different approaches taken.
You can learn more about the book here.