From Data Science Bookcamp by Leonard Apeltsin This 3part article series covers:

Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.
Clustering is the process of organizing data points into conceptually meaningful groups. What makes a given group “conceptually meaningful”? There is no easy answer to that question. The usefulness of any clustered output is dependent on the task we’ve been assigned.
Imagine that we’re asked to cluster a collection of pet photos. Do we cluster fish and lizards in one group and fluffy pets (such as hamsters, cats, and dogs) in another? Or should hamsters, cats, and dogs be assigned three separate clusters of their own? If so, perhaps we should consider clustering pets by breed. Thus, Chihuahuas and Great Danes fall into diverging clusters. Differentiating between dog breeds will not be easy. However, we can easily distinguish between Chihuahuas and Great Danes based on breed size. Maybe we should compromise: we’ll cluster on both fluffiness and size, thus bypassing the distinction between the Cairn Terrier and the similarlooking Norwich Terrier.
Is the compromise worth it? It depends on our data science task. Suppose we work for a pet food company, and our aim is to estimate demand for dog food, cat food, and lizard food. Under these conditions, we must distinguish between fluffy dogs, fluffy cats, and scaly lizards. However, we won’t need to resolve differences between separate dog breeds. Alternatively, imagine an analyst at a vet’s office who’s trying to group pet patients by their breed. This second task requires a much more granular level of group resolution.
Different situations depend on different clustering techniques. As data scientists, we must choose the correct clustering solution. Over the course of our careers, we will cluster thousands (if not tens of thousands) of datasets using a variety of clustering techniques. The most commonly used algorithms rely on some notion of centrality to distinguish between clusters.
Using centrality to discover clusters
The centrality of data can be represented using the mean. Imagine that you have computed the mean length of two groups of fish, and compared them by analyzing the difference between their means. We can utilize that difference to determine whether all the fish belong to the same group. Intuitively, all data points in a single group should cluster around one central value. Meanwhile, the measurements in two divergent groups should cluster around two different means. Thus, we can utilize centrality to distinguish between two divergent groups. Let’s explore this notion in concrete detail.
Suppose we take a field trip to a lively local pub and see two dartboards hanging side by side. Each of the dartboards is covered in darts, and darts also protrude from the walls. The tipsy players in the pub aim for the bull’seye of one board or the other. Frequently, they miss, which leads to the observed scattering of darts centered around the two bull’seyes.
Let’s simulate the scattering numerically. We’ll treat each bull’seye location as a 2D coordinate. Darts are randomly flung at that coordinate. Consequently, the 2D positions of the darts are randomly distributed. The most appropriate distribution for modeling dart positions is the normal distribution, for the following reasons:
 A typical dart thrower aims at the bull’seye, not at the edge of the dartboard. Thus, each dart is more likely to strike close to the center of the board. This behavior is consistent with random normal samples, in which values closer to the mean occur more frequently than other, more distant values.
 We expect the darts to strike the board symmetrically relative to the center. Darts will strike 3 inches left of center and 3 inches right of center with equal frequency. This symmetry is captured by the bellshaped normal curve.
Suppose the first bull’seye is located at coordinate [0, 0]
. A dart is thrown at that coordinate. We’ll model the x and ypositions of the dart using two normal distributions. These distributions share a mean of 0, and we also assume that they share a variance of 2. The following code generates the random coordinates of the dart.
Listing 1. Modeling dart coordinates using two normal distributions
import numpy as np np.random.seed(0) mean = 0 variance = 2 x = np.random.normal(mean, variance ** 0.5) y = np.random.normal(mean, variance ** 0.5) print(f"The x coordinate of a randomly thrown dart is {x:.2f}") print(f"The y coordinate of a randomly thrown dart is {y:.2f}") The x coordinate of a randomly thrown dart is 2.49 The y coordinate of a randomly thrown dart is 0.57
We can more efficiently model dart positions using the np.random.multivariate_normal
method. This method selects a single random point from a multivariate normal distribution. The multivariate normal curve is simply a normal curve that is extended to more than one dimension. Our 2D multivariate normal distribution will resemble a round hill whose summit is positioned at [0, 0]
.
Let’s simulate 5,000 random darts tossed at the bull’seye positioned at [0, 0]
. We also simulate 5,000 random darts tossed at a second bull’seye, positioned at [0, 6]
. Then we generate a scatter plot of all the random dart coordinates (figure 10.1).
Listing 2. Simulating randomly thrown darts
import matplotlib.pyplot as plt np.random.seed(1) bulls_eye1 = [0, 0] bulls_eye2 = [6, 0] bulls_eyes = [bulls_eye1, bulls_eye2] x_coordinates, y_coordinates = [], [] for bulls_eye in bulls_eyes: for _ in range(5000): x = np.random.normal(bulls_eye[0], variance ** 0.5) y = np.random.normal(bulls_eye[1], variance ** 0.5) x_coordinates.append(x) y_coordinates.append(y) plt.scatter(x_coordinates, y_coordinates) plt.show()
Listing 2 includes a nested fiveline for
loop beginning with for _ in range(5000)
. It’s possible to use NumPy to execute this loop in just one line of code: running x_coordinates, y_coordinates = np.random.multivariate_normal(bulls_eye, np.diag(2 * [variance]), 5000).T
returns 5,000 xand ycoordinates sampled from the multivariate normal distribution.
Figure 1. A simulation of darts randomly scattered around two bull’seye targets
Two overlapping dart groups appear in the plot. The two groups represent 10,000 darts. Half the darts were aimed at the bull’seye on the left, and the rest were aimed toward the right. Each dart has an intended target, which we can estimate by looking at the plot. Darts closer to [0, 0]
were probably aimed at the bull’seye on the left. We’ll incorporate this assumption into our dart plot.
Let’s assign each dart to its nearest bull’seye. We start by defining a nearest_bulls_eye
function that takes as input a dart
list holding a dart’s x and ypositions. The function returns the index of the bull’seye that is most proximate to dart
. We measure dart proximity using Euclidean distance, which is the standard straightline distance between two points.
Euclidean distance arises from the Pythagorean theorem. Suppose we examine a dart at position [x_dart, y_dart]
relative to a bull’seye at position [x_bull, y_bull]
. According to the Pythagorean theorem, distance
^{2} = (x_dart  x_bull)
^{2} + (y_dart  y_bull)
^{2}. We can solve for distance using a custom Euclidean function. Alternatively, we can use the scipy.spatial.distance.euclidean
function provided by SciPy.
The following code defines nearest_bulls_eye
and applies it to darts [0, 1]
and [6, 1]
.
Listing 3. Assigning darts to the nearest bull’seye
from scipy.spatial.distance import euclidean def nearest_bulls_eye(dart): distances = [euclidean(dart, bulls_e) for bulls_e in bulls_eyes] ❶ return np.argmin(distances) ❷ darts = [[0,1], [6, 1]] for dart in darts: index = nearest_bulls_eye(dart) print(f"The dart at position {dart} is closest to bullseye {index}")
❶ Obtains the Euclidean distance between the dart and each bull’seye using the euclidean function imported from SciPy
❷ Returns the index matching the shortest bull’seye distance in distances
Now we apply nearest_bulls_eye
to all our computed dart coordinates. Each dart point is plotted using one of two colors to distinguish between the two bull’seye assignments (figure 10.2).
Listing 4. Coloring darts based on the nearest bull’seye
def color_by_cluster(darts): ❶ nearest_bulls_eyes = [nearest_bulls_eye(dart) for dart in darts] for bs_index in range(len(bulls_eyes)): selected_darts = [darts[i] for i in range(len(darts)) if bs_index == nearest_bulls_eyes[i]] ❷ x_coordinates, y_coordinates = np.array(selected_darts).T ❸ plt.scatter(x_coordinates, y_coordinates, color=['g', 'k'][bs_index]) plt.show() darts = [[x_coordinates[i], y_coordinates[i]] for i in range(len(x_coordinates))] ❹ color_by_cluster(darts)
❶ Helper function that plots the colored elements of an inputted darts list. Each dart in darts serves as input for nearest_bulls_eye.
❷ Selects the darts most proximate to bulls_eyes[bs_index]
❸ Separates the x and ycoordinates of each dart by transposing an array of selected darts. The transpose swaps the row and column positions with a 2D data structure.
❹ Combines the separate coordinates of each dart into a single list of x and ycoordinates.
Figure 2. Darts colored based on proximity to the nearest bull’seye. Cluster A represents all points closest to the left bull’seye, and cluster B represents all points closest to the right bull’seye.
The colored darts split sensibly into two even clusters. How would we identify such clusters if no central coordinates were provided? Well, one primitive strategy is to simply guess the location of the bull’seyes. We can pick two random darts and hope these darts are somehow relatively close to each of the bull’seyes, although the likelihood of that happening is incredibly low. In most cases, coloring darts based on two randomly chosen centers will not yield good results (figure 3).
Listing 5. Assigning darts to randomly chosen centers
bulls_eyes = np.array(darts[:2]) ❶
color_by_cluster(darts)
❶ Randomly selects the first two darts to be our representative bull’seyes
Figure 3. Darts colored based on proximity to randomly selected centers. Cluster B is stretched too far to the left.
Our indiscriminately chosen centers feel wrong qualitatively. For instance, cluster B on the right seems to be stretching way too far to the left. The arbitrary center we’ve assigned doesn’t appear to match its actual bull’seye point. But there’s a way to remedy our error: we can compute the mean coordinates of all the points in the stretched right clustered group and then utilize these coordinates to adjust our estimation of the group’s center. After assigning the cluster’s mean coordinates to the bull’seye, we can reapply our distancebased grouping technique to adjust the rightmost cluster’s boundaries. In fact, for maximum effectiveness, we will also reset the leftmost cluster’s center to its mean prior to rerunning our centralitybased clustering (figure 10.4).
When we compute the mean of a 1D array, we return a single value. We are now extending that definition to encompass multiple dimensions. When we compute the mean of a 2D array, we return the mean of all xcoordinates and also the mean of all ycoordinates. The final output is a 2D array containing means across the xaxis and yaxis.
Listing 6. Assigning darts to centers based on means
def update_bulls_eyes(darts):
updated_bulls_eyes = []
nearest_bulls_eyes = [nearest_bulls_eye(dart) for dart in darts]
for bs_index in range(len(bulls_eyes)):
selected_darts = [darts[i] for i in range(len(darts))
if bs_index == nearest_bulls_eyes[i]]
x_coordinates, y_coordinates = np.array(selected_darts).T
mean_center = [np.mean(x_coordinates), np.mean(y_coordinates)] ❶
updated_bulls_eyes.append(mean_center)
return updated_bulls_eyes
bulls_eyes = update_bulls_eyes(darts)
color_by_cluster(darts)
❶ Takes the mean of the x and y coordinates for all the darts assigned to a given bull’seye. These average coordinates are then used to update our estimated bull’seye position. We can more efficiently run this calculation by executing np.mean(selected_darts, axis=0).
Figure 4. Darts colored based on proximity to recomputed centers. The two clusters now appear to be more even.
The results are already looking better, although they’re not quite as effective as they could be. The cluster’s centers still appear a little off. Let’s remedy the results by repeating the meanbased centrality adjustment over 10 additional iterations (figure 5).
Listing 7. Adjusting bull’seye positions over 10 iterations
for i in range(10): bulls_eyes = update_bulls_eyes(darts) color_by_cluster(darts)
Figure 5. Darts colored based on proximity to iteratively recomputed centers
The two sets of darts are now perfectly clustered! We have essentially replicated the Kmeans clustering algorithm, which organizes data using centrality.
Check out part 2 here. If you want to learn more about the book, check it out on Manning’s liveBook platform here.