From Machine Learning for Business by Doug Hudgeon and Richard Nichol
In this article, you’ll see how SageMaker and the Random Cut Forest algorithm can be used to create a model that will highlight the invoice lines that Brett should query with the law firm. The result will be a repeatable process that Brett can apply to every invoice that will keep the lawyers working for his bank on their toes and will save the bank hundreds of thousands of dollars per year. Off we go!
Brett works as a lawyer for a large bank. He is responsible for checking that the law firms hired by the bank bill the bank correctly. How tough can this be you ask? Pretty tough is the answer. Last year, Brett’s bank used hundreds of different firms across thousands of different legal matters and each invoice submitted by the firm contains dozens or hundreds of lines. Tracking this using spreadsheets is a nightmare.
What is AWS Sagemaker?
Sagemaker is Amazon’s environment for building and deploying machine learning models. Let’s look at the functionality it provides. Sagemaker is revolutionary because it:
- Serves as your development environment in the cloud so you don’t have to set up a development environment on your computer.
- Use a preconfigured machine learning model on your data.
- Use inbuilt tools to validate the results from machine learning model.
- Host your machine learning model.
- Automatically set up an endpoint that take in new data and return predictions.
The diagram below shows how the pieces in Sagemaker fit together.
- Upload the orders dataset to AWS (the AWS system we’ll use to store the data is called S3),
- Create a Jupyter notebook on Sagemaker,
- Train a machine learning model on the dataset,
- Host the machine learning model,
- Create an endpoint that you can use to get predictions from the model, and
- Run the notebook and generate predictions for whether 50 orders require an approver or not.
Figure 1. How Sagemaker works
For the purposes of this article, let’s imagine that Brett has set up a Jupyter notebook on Sagemaker to train and host the machine learning model on the data set represented in the two samples that will be discussed.
What is the Random Cut Forest algorithm?
The machine learning algorithm you’ll use in this article is called Random Cut Forest. It’s a wonderfully descriptive name because the algorithm takes a bunch of random data points (Random), cuts them to the same number of points and creates trees (Cut). It then looks at all of the trees together (Forest) to determine whether a particular data point is an anomaly: Random Cut Forest.
A tree is an ordered way of storing numerical data. The simplest type of tree is called a binary tree. It’s a great way to store data because it’s easy and fast for a computer to use. To create a tree, you randomly subdivide the data points until you isolate the point you’re testing to determine whether it’s an anomaly. Each time you subdivide the data points, it creates a new level of the tree.
The fewer times you need to subdivide the data points before you isolate the target data point the more likely it is that the data point is an anomaly for that sample of data.
In the two sections that follow, you’ll look at two examples of trees with a target data point injected. In the first sample, the target data point appears to be an anomaly. In the second sample, the target data point isn’t an anomaly. When you look at the samples together as a forest, you’ll see that this point isn’t likely to be an anomaly.
Figure 2 shows six grey dots which represent six data points pulled at random from the dataset. These data points are drawn from any of the columns in the dataset. The white dot represents the target data point that you’re testing to determine if it’s an anomaly. Visually, you can see that this white dot sits somewhat apart from the other values in this sample of data; it may be an anomaly. But how do you determine this algorithmically? This is where the tree representation comes in.
Figure 2. Sample 1: Level 1 data points
Figure 3 shows the top level of the tree. The top level is a single node that represents all of the data points in the sample (including the target data point you’re testing). If the node contains any data points other than the target point you are testing for, the color of the node is shown as grey. The top level node is always grey because it represents all of the data points in the sample.
Figure 3. Sample 1: Level 1 tree
Figure 4 shows the data points after the first subdivision. The subdivided line is inserted at random through the data points. Each side of the subdivision represents a node in the tree.
Figure 4. Sample 1: Level 2 data points
Figure 5 shows the next level of the tree. The left side of figure 4 becomes Node B on the left of the tree. The right side of figure 3 becomes Node C on the right of the tree.
Both nodes in the tree are shown as grey because both sides of the subdivided diagram in figure 4 contain at least one grey dot.
Figure 5. Sample 1: Level 2 tree
The next step is to further subdivide the part of the diagram that contains the target data point. This is shown in figure 6. You can see that Node C on the right is untouched whereas the left side is subdivided into Nodes D and E.
Node E contains only the target data point; no further subdivision is required.
Figure 6. Sample 1: Level 3 data points
Figure 7 shows the final tree. Node E is shown in white because it contains the target data point. The tree has three levels. The smaller the tree, the greater the likelihood that the point is an anomaly. A three-level tree is a pretty small tree indicating that the target data point may be anomaly.
Figure 7. Sample 1: Level 3 tree
Now, take a look at another sample of six data points which are clustered more closely around the target data point.
In the second data sample, the randomly selected data points are clustered more closely around the target data point. It’s important to note that our target data point is the same data point that was used in sample 1. The only difference is that a different sample of data points was drawn from the dataset. You can see in figure 8 that the data points in the sample (grey dots) are more closely clustered around the target data point than they were in sample 1.
Note that in figure 8 and the following figures in this section, the tree is displayed below the diagram of the data points.
Figure 8. Sample 2: Level 1 data points and tree
As in sample 1, figure 9 splits the diagram into two sections which we labelled B and C. Because both sections contain grey dots, Level 2 of the tree diagram is shown as grey.
Figure 9. Sample 2: Level 2 data points and tree
Next, the section containing the target data point is split again. Figure 10 shows that section B has been split into two sections, labelled D and E, and a new level has been added to the tree diagram. Both of these sections contain one or more grey dots and level 3 of the tree diagram is shown as grey.
Figure 10. Sample 2: Level 3 data points and tree
The target data point is in section E and that section is split into two sections labelled F and G as shown in figure 11.
Figure 11. Sample 2: Level 4 data points and tree
The target data point is in section F and that section is split into two sections labelled H and J as shown in figure 12. Section J contains only the target data point and it’s shown as white. No further splitting is required. The resulting diagram has five levels which indicates that the target data point isn’t likely to be an anomaly.
Figure 12. Sample 2: Level five data points and tree
The final step performed by the Random Cut Forest algorithm is to combine the trees into a forest. If lots of the samples have small trees then the target data point is likely to be an anomaly. If only a few of the samples have small trees then it’s unlikely to be an anomaly.