|
From Data Science Bookcamp by Leonard Apeltsin This 3-part article series covers:
|
Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.
Clustering is the process of organizing data points into conceptually meaningful groups. What makes a given group “conceptually meaningful”? There is no easy answer to that question. The usefulness of any clustered output is dependent on the task we’ve been assigned.
Imagine that we’re asked to cluster a collection of pet photos. Do we cluster fish and lizards in one group and fluffy pets (such as hamsters, cats, and dogs) in another? Or should hamsters, cats, and dogs be assigned three separate clusters of their own? If so, perhaps we should consider clustering pets by breed. Thus, Chihuahuas and Great Danes fall into diverging clusters. Differentiating between dog breeds will not be easy. However, we can easily distinguish between Chihuahuas and Great Danes based on breed size. Maybe we should compromise: we’ll cluster on both fluffiness and size, thus bypassing the distinction between the Cairn Terrier and the similar-looking Norwich Terrier.
Is the compromise worth it? It depends on our data science task. Suppose we work for a pet food company, and our aim is to estimate demand for dog food, cat food, and lizard food. Under these conditions, we must distinguish between fluffy dogs, fluffy cats, and scaly lizards. However, we won’t need to resolve differences between separate dog breeds. Alternatively, imagine an analyst at a vet’s office who’s trying to group pet patients by their breed. This second task requires a much more granular level of group resolution.
Different situations depend on different clustering techniques. As data scientists, we must choose the correct clustering solution. Over the course of our careers, we will cluster thousands (if not tens of thousands) of datasets using a variety of clustering techniques. The most commonly used algorithms rely on some notion of centrality to distinguish between clusters.
Using centrality to discover clusters
The centrality of data can be represented using the mean. Imagine that you have computed the mean length of two groups of fish, and compared them by analyzing the difference between their means. We can utilize that difference to determine whether all the fish belong to the same group. Intuitively, all data points in a single group should cluster around one central value. Meanwhile, the measurements in two divergent groups should cluster around two different means. Thus, we can utilize centrality to distinguish between two divergent groups. Let’s explore this notion in concrete detail.
Suppose we take a field trip to a lively local pub and see two dartboards hanging side by side. Each of the dartboards is covered in darts, and darts also protrude from the walls. The tipsy players in the pub aim for the bull’s-eye of one board or the other. Frequently, they miss, which leads to the observed scattering of darts centered around the two bull’s-eyes.
Let’s simulate the scattering numerically. We’ll treat each bull’s-eye location as a 2D coordinate. Darts are randomly flung at that coordinate. Consequently, the 2D positions of the darts are randomly distributed. The most appropriate distribution for modeling dart positions is the normal distribution, for the following reasons:
- A typical dart thrower aims at the bull’s-eye, not at the edge of the dartboard. Thus, each dart is more likely to strike close to the center of the board. This behavior is consistent with random normal samples, in which values closer to the mean occur more frequently than other, more distant values.
- We expect the darts to strike the board symmetrically relative to the center. Darts will strike 3 inches left of center and 3 inches right of center with equal frequency. This symmetry is captured by the bell-shaped normal curve.
Suppose the first bull’s-eye is located at coordinate [0, 0]
. A dart is thrown at that coordinate. We’ll model the x- and y-positions of the dart using two normal distributions. These distributions share a mean of 0, and we also assume that they share a variance of 2. The following code generates the random coordinates of the dart.
Listing 1. Modeling dart coordinates using two normal distributions
import numpy as np np.random.seed(0) mean = 0 variance = 2 x = np.random.normal(mean, variance ** 0.5) y = np.random.normal(mean, variance ** 0.5) print(f"The x coordinate of a randomly thrown dart is {x:.2f}") print(f"The y coordinate of a randomly thrown dart is {y:.2f}") The x coordinate of a randomly thrown dart is 2.49 The y coordinate of a randomly thrown dart is 0.57
We can more efficiently model dart positions using the np.random.multivariate_normal
method. This method selects a single random point from a multivariate normal distribution. The multivariate normal curve is simply a normal curve that is extended to more than one dimension. Our 2D multivariate normal distribution will resemble a round hill whose summit is positioned at [0, 0]
.
Let’s simulate 5,000 random darts tossed at the bull’s-eye positioned at [0, 0]
. We also simulate 5,000 random darts tossed at a second bull’s-eye, positioned at [0, 6]
. Then we generate a scatter plot of all the random dart coordinates (figure 10.1).
Listing 2. Simulating randomly thrown darts
import matplotlib.pyplot as plt np.random.seed(1) bulls_eye1 = [0, 0] bulls_eye2 = [6, 0] bulls_eyes = [bulls_eye1, bulls_eye2] x_coordinates, y_coordinates = [], [] for bulls_eye in bulls_eyes: for _ in range(5000): x = np.random.normal(bulls_eye[0], variance ** 0.5) y = np.random.normal(bulls_eye[1], variance ** 0.5) x_coordinates.append(x) y_coordinates.append(y) plt.scatter(x_coordinates, y_coordinates) plt.show()
Listing 2 includes a nested five-line for
loop beginning with for _ in range(5000)
. It’s possible to use NumPy to execute this loop in just one line of code: running x_coordinates, y_coordinates = np.random.multivariate_normal(bulls_eye, np.diag(2 * [variance]), 5000).T
returns 5,000 x-and y-coordinates sampled from the multivariate normal distribution.
Figure 1. A simulation of darts randomly scattered around two bull’s-eye targets
Two overlapping dart groups appear in the plot. The two groups represent 10,000 darts. Half the darts were aimed at the bull’s-eye on the left, and the rest were aimed toward the right. Each dart has an intended target, which we can estimate by looking at the plot. Darts closer to [0, 0]
were probably aimed at the bull’s-eye on the left. We’ll incorporate this assumption into our dart plot.
Let’s assign each dart to its nearest bull’s-eye. We start by defining a nearest_bulls_eye
function that takes as input a dart
list holding a dart’s x- and y-positions. The function returns the index of the bull’s-eye that is most proximate to dart
. We measure dart proximity using Euclidean distance, which is the standard straight-line distance between two points.
Euclidean distance arises from the Pythagorean theorem. Suppose we examine a dart at position [x_dart, y_dart]
relative to a bull’s-eye at position [x_bull, y_bull]
. According to the Pythagorean theorem, distance
2 = (x_dart - x_bull)
2 + (y_dart - y_bull)
2. We can solve for distance using a custom Euclidean function. Alternatively, we can use the scipy.spatial.distance.euclidean
function provided by SciPy.
The following code defines nearest_bulls_eye
and applies it to darts [0, 1]
and [6, 1]
.
Listing 3. Assigning darts to the nearest bull’s-eye
from scipy.spatial.distance import euclidean def nearest_bulls_eye(dart): distances = [euclidean(dart, bulls_e) for bulls_e in bulls_eyes] ❶ return np.argmin(distances) ❷ darts = [[0,1], [6, 1]] for dart in darts: index = nearest_bulls_eye(dart) print(f"The dart at position {dart} is closest to bulls-eye {index}")
❶ Obtains the Euclidean distance between the dart and each bull’s-eye using the euclidean function imported from SciPy
❷ Returns the index matching the shortest bull’s-eye distance in distances
Now we apply nearest_bulls_eye
to all our computed dart coordinates. Each dart point is plotted using one of two colors to distinguish between the two bull’s-eye assignments (figure 10.2).
Listing 4. Coloring darts based on the nearest bull’s-eye
def color_by_cluster(darts): ❶ nearest_bulls_eyes = [nearest_bulls_eye(dart) for dart in darts] for bs_index in range(len(bulls_eyes)): selected_darts = [darts[i] for i in range(len(darts)) if bs_index == nearest_bulls_eyes[i]] ❷ x_coordinates, y_coordinates = np.array(selected_darts).T ❸ plt.scatter(x_coordinates, y_coordinates, color=['g', 'k'][bs_index]) plt.show() darts = [[x_coordinates[i], y_coordinates[i]] for i in range(len(x_coordinates))] ❹ color_by_cluster(darts)
❶ Helper function that plots the colored elements of an inputted darts list. Each dart in darts serves as input for nearest_bulls_eye.
❷ Selects the darts most proximate to bulls_eyes[bs_index]
❸ Separates the x- and y-coordinates of each dart by transposing an array of selected darts. The transpose swaps the row and column positions with a 2D data structure.
❹ Combines the separate coordinates of each dart into a single list of x- and y-coordinates.
Figure 2. Darts colored based on proximity to the nearest bull’s-eye. Cluster A represents all points closest to the left bull’s-eye, and cluster B represents all points closest to the right bull’s-eye.
The colored darts split sensibly into two even clusters. How would we identify such clusters if no central coordinates were provided? Well, one primitive strategy is to simply guess the location of the bull’s-eyes. We can pick two random darts and hope these darts are somehow relatively close to each of the bull’s-eyes, although the likelihood of that happening is incredibly low. In most cases, coloring darts based on two randomly chosen centers will not yield good results (figure 3).
Listing 5. Assigning darts to randomly chosen centers
bulls_eyes = np.array(darts[:2]) ❶
color_by_cluster(darts)
❶ Randomly selects the first two darts to be our representative bull’s-eyes
Figure 3. Darts colored based on proximity to randomly selected centers. Cluster B is stretched too far to the left.
Our indiscriminately chosen centers feel wrong qualitatively. For instance, cluster B on the right seems to be stretching way too far to the left. The arbitrary center we’ve assigned doesn’t appear to match its actual bull’s-eye point. But there’s a way to remedy our error: we can compute the mean coordinates of all the points in the stretched right clustered group and then utilize these coordinates to adjust our estimation of the group’s center. After assigning the cluster’s mean coordinates to the bull’s-eye, we can reapply our distance-based grouping technique to adjust the rightmost cluster’s boundaries. In fact, for maximum effectiveness, we will also reset the leftmost cluster’s center to its mean prior to re-running our centrality-based clustering (figure 10.4).
When we compute the mean of a 1D array, we return a single value. We are now extending that definition to encompass multiple dimensions. When we compute the mean of a 2D array, we return the mean of all x-coordinates and also the mean of all y-coordinates. The final output is a 2D array containing means across the x-axis and y-axis.
Listing 6. Assigning darts to centers based on means
def update_bulls_eyes(darts):
updated_bulls_eyes = []
nearest_bulls_eyes = [nearest_bulls_eye(dart) for dart in darts]
for bs_index in range(len(bulls_eyes)):
selected_darts = [darts[i] for i in range(len(darts))
if bs_index == nearest_bulls_eyes[i]]
x_coordinates, y_coordinates = np.array(selected_darts).T
mean_center = [np.mean(x_coordinates), np.mean(y_coordinates)] ❶
updated_bulls_eyes.append(mean_center)
return updated_bulls_eyes
bulls_eyes = update_bulls_eyes(darts)
color_by_cluster(darts)
❶ Takes the mean of the x- and y coordinates for all the darts assigned to a given bull’s-eye. These average coordinates are then used to update our estimated bull’s-eye position. We can more efficiently run this calculation by executing np.mean(selected_darts, axis=0).
Figure 4. Darts colored based on proximity to recomputed centers. The two clusters now appear to be more even.
The results are already looking better, although they’re not quite as effective as they could be. The cluster’s centers still appear a little off. Let’s remedy the results by repeating the mean-based centrality adjustment over 10 additional iterations (figure 5).
Listing 7. Adjusting bull’s-eye positions over 10 iterations
for i in range(10): bulls_eyes = update_bulls_eyes(darts) color_by_cluster(darts)
Figure 5. Darts colored based on proximity to iteratively recomputed centers
The two sets of darts are now perfectly clustered! We have essentially replicated the K-means clustering algorithm, which organizes data using centrality.
Check out part 2 here. If you want to learn more about the book, check it out on Manning’s liveBook platform here.