From Distributed Machine Learning Patterns by Yuan Tang
In this article, we introduce the parameter server pattern which comes handy for situations where the model is too large to fit in a single machine such as one we would have to build for tagging entities in the 8 millions of YouTube videos.
Assume that we have a dataset YouTube-8M at hand that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities (such as food, car, music, etc.) as shown in Figure 1, and we’d like to train a machine learning model to tag the main themes of new YouTube videos that the model hasn’t seen before.
Figure 1. The website that hosts the YouTube-8M dataset with millions of YouTube videos from a diverse vocabulary of 3,800 visual entities.
In this dataset, there are more than three thousands entities, including both coarse and fine-grained entities where coarse entities are the ones that can be recognized by non-domain experts after studying some existing examples and fine-grained entities are identified by domain experts who know how to differentiate between even extremely similar entities.
These entities have been semi-automatically curated and manually verified by 3 raters to be visually recognizable and each one of them has at least 200 corresponding video examples, with an average of 3552 training videos per entity.
When the raters are identifying the entities in the videos, they are given a guideline to assess how specific and visually recognizable each entity is, on a discrete scale from 1 to 5 where 1 can be easily identified by a layperson, as shown in Figure 2.
Figure 2. A screenshot of the question and guideline displayed to human raters when identifying the entities in the YouTube videos to assess how visually recognizable each entity is.
Using the online dataset explorer provided by YouTube-8M, we’ll see the list of entities with the number of videos that belong to each entity on the right hand side of its entity name as seen in Figure 3.
Figure 3. A screenshot of the dataset explorer provided by YouTube-8M website where the entities are ordered by the number of videos in each entity.
Note that in the dataset explorer, the entities are ordered by the number of videos in each entity. We’ll see that the three most popular entities are Game, Video Game, and Vehicle, respectively, ranging from 415,890 to 788,288 training examples. The least frequent are Cylinder and Mortar, with 123 and 127 training videos, respectively.
With this dataset, we’d like to train a machine learning model to tag the main themes of new YouTube videos that the model hasn’t seen before. This may be a trivial task for a simpler dataset and machine learning model but it’s certainly not the case for YouTube-8M dataset. This dataset comes with precomputed audio-visual features from billions of frames and audio segments so that we don’t have to calculate and obtain those on our own which often takes a long time and requires a large amount of computational resources.
Even though it is possible to train a strong baseline model on this dataset in less than a day on a single GPU, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train. Is there any solution to train this potentially very large model efficiently?
Let’s first take a look at some of the entities using the data explorer on YouTube-8M website and see if there are any relationships among the entities. For example, are these entities unrelated to each other or do they have some level of overlap in the content? After some exploration, we will then make necessary adjustments to the model to take those relationships into account.
Figure 4 shows a list of YouTube videos that belong to the Pet entity. For example, there’s a kid playing with a dog in the third video of the first row.
Figure 4. Example videos that belong to the Pet entity.
Let’s take a look at a similar entity. Figure 5 shows a list of YouTube videos that belong to the Animal entity. We can see animals such as fishes, horses, and pandas in this entity. Interestingly, there’s a cat getting cleaned by a vacuum in the third video of the fifth row. One might guess that this video is probably in the Pet entity as well since a cat can also be a pet if it’s adopted by human-beings.
If we’d like to build machine learning models on this, we may do some additional feature engineering before fitting the model with the dataset directly. For example, we may combine the audio-visual features of these two entities into a derived feature since they provide similar information and have overlap, which can boost the model’s performance depending on the specific machine learning model we selected. If we continue exploring the combinations of the existing audio-visual features in the entities or perform a huge number of feature engineering steps, we may no longer be able to train a machine learning model on this dataset in less than a day on a single GPU.
Figure 5. Example videos that belong to the Animal entity.
If we are using a deep learning model instead of traditional machine learning models that require a lot of feature engineering and exploration of the dataset, the model itself would learn the underlying relationships among features, e.g. audio-visual features among similar entities. Each neural network layer in the model architecture consists of vectors of weights and biases that represent a trained neural network layer that will get updated over training iterations as it gathers more knowledge from the dataset.
For example, if we only use 10 out of the 3,862 entities and build a LeNet5 model such as shown in Figure 6 that could classify new YouTube videos into one of the 10 selected entities. At a high level, LeNet consists of a convolutional encoder consisting of two convolutional layers and a dense block consisting of three fully-connected layers. For simplicity, in this case we assume that each individual frame from the videos is a 28×28 image and it will be processed by various convolution and pooling layers that learns the underlying feature mapping between the audio-visual features and the entities.
Figure 6. LeNet model architecture that could be used to classify new YouTube videos into one of the 10 selected entities.
In fact, those learned feature maps contain parameters that are related to the model. These parameters are in the forms of numeric vectors that are used as weights and biases for this layer of model representation. For each training iteration, the model will take every frame in the YouTube videos as features, calculate the loss, and then update those model parameters to optimize the model’s objective so that the relationship between features and the entities can be modeled more closely.
Unfortunately this training process is slow as it involves updating all the parameters in different layers. Here we have two potential solutions to speed up the training process.
Now let’s take a look at the first approach. We do want to make an assumption here and we’ll remove it later once we discuss a better approach. Let’s assume that the model is not too large and we can fit the entire model using existing resources without any possibility of out of memory/disk.
In this case, we can use one dedicated server to store all the LeNet model parameters and use multiple worker machines to split the computational workloads. An architecture diagram is shown in Figure 7.
Figure 7. A machine learning training component with a single parameter server.
Each worker node takes a particular part of the dataset to calculate the gradients and then sends the results to the dedicated server to update the LeNet model parameters. Since the worker nodes use isolated computational resources, they can perform the heavy computations asynchronous without having to communicate with each other. Therefore we’ve achieved around three times speed-up by simply introducing additional worker nodes if other costs like message passing among nodes are neglected.
This dedicated single server responsible for storing and updating the model parameters is called a parameter-server and we’ve actually just designed a more efficient distributed machine learning training system by incorporating the parameter-server pattern.
Next comes the real-world challenge. Deep learning models often get really complex and additional layers with custom structure can be added on top of a baseline model. Those complex models usually take a lot of disk space due to the large number of model parameters in those additional layers and require a lot of computational resources in order to meet the memory footprint requirement that would end up training successfully. What if the model is very large and we cannot fit all of its parameters in a single parameter server?
Let’s talk about the second solution that could address the challenges in this situation. We can introduce additional parameter servers where each of the parameter servers is responsible for storing and updating a particular model partition whereas each different worker node is responsible for taking a particular part of the dataset to update the model parameters in a model partition.
An architecture diagram of this pattern using multiple parameter servers can be found in Figure 8, which is different from Figure 7 where a single server is used to store all the LeNet model parameters and use multiple worker machines to split the computational workloads in Figure 7. Each worker node takes a subset of the dataset, performs calculations required in each of the neural network layers, and then sends the calculated gradients to update one model partition that’s stored in one of the parameter servers. Note that since all workers perform calculations in an asynchronous fashion, the model partitions that each worker node uses to calculate the gradients may not be up-to-date. To guarantee that that model partitions each worker node is using or each parameter server is storing is the most recent, we will have to constantly pull and push the updates of the model in between.
Figure 8. A machine learning training component with multiple parameter servers.
With the help of parameter servers, we could effectively resolve the challenges we have when building a machine learning model to tag the main themes of new YouTube videos that the model hasn’t seen before. For example, Figure 9 shows a list of YouTube videos that are not used for model training and they have been tagged with the Aircraft theme successfully by the trained machine learning model. Even when the model might be too large to fit in a single machine, we could still successfully train the model efficiently.
Figure 9. List of new YouTube videos not used for model training and tagged with Aircraft theme.
In the previous section, we introduced the parameter server pattern and how it can be used to address the potential challenges for the YouTube-8M video identification application. Even though the parameter server pattern is useful in situations where the model is too large to fit in a single machine and this seems like a straightforward approach to address the challenge, however, in real-world applications, there are still decisions that have to be made in order to make the distributed training system efficient.
Machine learning researchers and DevOps engineers often find it hard to figure out a good ratio between the number of parameter servers and the number of workers for different machine learning applications. There are non-trivial communication costs to send the calculated gradients from workers to parameter servers, as well as costs to pull and push the updates of the most recent model partitions. For example, if we find the model is getting larger and we add too many parameter servers to the system, the system will end up spending a lot of time communicating between nodes whereas we actually spent a very small amount of time on the computations among neural network layers..
- If we’d like to train a model with multiple CPUs or GPUs on a single laptop, is this considered distributed training?
- What’s the result of increasing the number of workers or parameter servers?
- What types of computational resources (e.g. CPUs/GPUs/memory/disk) should we allocate to parameter servers and how much of those different types of resources should we allocate?
That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.