|
From Data Science Bookcamp by Leonard Apeltsin This 3-part article series covers:
|
Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.
K-means: A clustering algorithm for grouping data into K central groups
The K-means algorithm assumes that inputted data points swirl around K different centers. Each central coordinate is like a hidden bull’s-eye surrounded by scattered data points. The purpose of the algorithm is to uncover these hidden central coordinates.
We initialize K-means by first selecting K, which is the number of central coordinates we will search for. In our dartboard analysis, K was set to 2, although generally K can equal any whole number. The algorithm chooses K data points at random. These data points are treated as though they are true centers. Then the algorithm iterates by updating the chosen central locations, which data scientists call centroids. During a single iteration, every data point is assigned to its closest center, leading to the formation of K groups. Next, the center of each group is updated. The new center equals the mean of the group’s coordinates. If we repeat the process long enough, the group means will converge to K representative centers (figure 6). The convergence is mathematically guaranteed. However, we cannot know in advance the number of iterations required for the convergence to take place. A common trick is to halt the iterations when none of the newly computed centers deviate significantly from their predecessors.
Figure 6. The K-means algorithm iteratively converging from two randomly selected centroids to the actual bull’s-eye centroids
K-means is not without its limitations. The algorithm is predicated on our knowledge of K: the number of clusters to look for. Frequently, such knowledge is not available. Also, while K-means commonly finds reasonable centers, it’s not mathematically guaranteed to find the best possible centers in the data. Occasionally, K-means returns nonintuitive or suboptimal groups due to poor selection of random centroids at the initialization step of the algorithm. Finally, K-means presupposes that the clusters in the data actually swirl around K central locations. But as we learn later in the section, this supposition does not always hold.
K-means clustering using scikit-learn
The K-means algorithm runs in a reasonable time if it is implemented efficiently. A speedy implementation of the algorithm is available through the external scikit-learn library. Scikit-learn is an extremely popular machine learning toolkit built on top of NumPy and SciPy. It features a variety of core classification, regression, and clustering algorithms — including, of course, K-means. Let’s install the library. Then we import scikit-learn’s KMeans
clustering class.
Call pip install scikit-learn
from the command-line terminal to install the scikit-learn library.
Listing 8. Importing KMeans
from scikit-learn
from sklearn.cluster import Kmeans
Applying KMeans
to our darts
data is easy. First we need to run KMeans(n_clusters=2)
, which will create a cluster_model
object capable of finding two bull’s-eye centers. Then we can execute K-means by running cluster_model.fit_predict(darts)
. That method call will return an assigned_bulls_eyes
array that stores the bull’s-eye index of each dart.
Listing 9. K-means clustering using scikit-learn
cluster_model = KMeans(n_clusters=2) ❶ assigned_bulls_eyes = cluster_model.fit_predict(darts) ❷ print("Bull's-eye assignments:") print(assigned_bulls_eyes) Bull's-eye assignments: [0 0 0 ... 1 1 1]
❶ Creates a cluster_model object in which the number of centers is set to 2
❷ Optimizes two centers using the K-means algorithm and returns the assigned cluster for each dart
Let’s color our darts based on their clustering assignments to verify the results (figure 7).
Listing 10. Plotting K-means cluster assignments
for bs_index in range(len(bulls_eyes)): selected_darts = [darts[i] for i in range(len(darts)) if bs_index == assigned_bulls_eyes[i]] x_coordinates, y_coordinates = np.array(selected_darts).T plt.scatter(x_coordinates, y_coordinates, color=['g', 'k'][bs_index]) plt.show()
Figure 7. The K-means clustering results returned by scikit-learn are consistent with our expectations.
Our clustering model has located the centroids in the data. Now we can reuse these centroids to analyze new data points that the model has not seen before. Executing cluster_model.predict([x, y])
assigns a centroid to a data point defined by x
and y
. We use the predict
method to cluster two new data points.
Listing 11. Using cluster_model
to cluster new data
new_darts = [[500, 500], [-500, -500]] new_bulls_eye_assignments = cluster_model.predict(new_darts) for i, dart in enumerate(new_darts): bulls_eye_index = new_bulls_eye_assignments[i] print(f"Dart at {dart} is closest to bull's-eye {bulls_eye_index}") Dart at [500, 500] is closest to bull's-eye 0 Dart at [-500, -500] is closest to bull's-eye 1
Selecting the optimal K using the elbow method
K-means relies on an inputted K. This can be a serious hindrance when the number of authentic clusters in the data isn’t known in advance. We can, however, estimate an appropriate value for K using a technique known as the elbow method.
The elbow method depends on a calculated value called inertia, which is the sum of the squared distances between each point and its closest K-means center. If K is 1, then the inertia equals the sum of all squared distances to the dataset’s mean. This value, as discussed in section 5, is directly proportional to the variance. Variance, in turn, is a measure of dispersion. Thus, if K is 1, the inertia is an estimate of dispersion. This property holds true even if K is greater than 1. Basically, inertia estimates total dispersion around our K computed means.
By estimating dispersion, we can determine whether our K value is too high or too low. For example, imagine that we set K to 1. Potentially, many of our data points will be positioned too far from one center. Our dispersion will be large, and our inertia will be large. As we increase K toward a more sensible number, the additional centers will cause the inertia to decrease. Eventually, if we go overboard and set K equal to the total number of points, each data point will fall into its own private cluster. Dispersion will be eliminated, and inertia will drop to zero (figure 8).
Figure 8. Six points, numbered 1 through 6, are plotted in 2D space. The centers, marked by stars, are computed across various values of K. A line is drawn from every point to its nearest center. Inertia is computed by summing the squared lengths of the six lines. (A) K = 1. All six lines stretch out from a single center. The inertia is quite large. (B) K = 2. Points 5 and 6 are very close to a second center. The inertia is reduced. © K = 3. Points 1 and 3 are substantially closer to a newly formed center. Points 2 and 4 are also substantially closer to a newly formed center. The inertia has radically decreased. (D) K = 4. Points 1 and 3 now overlap with their centers. Their contribution to the inertia has shifted from a very low value to zero. The distances between the remaining four points and their associated centers remain unchanged. Thus, increasing K from 3 to 4 caused a very small decrease in inertia.
Some inertia values are too large. Others are too low. Somewhere in between might lie a value that’s just right. How do we find it?
Let’s work out a solution. We begin by plotting the inertia of our dartboard dataset over a large range of K values (figure 9). Inertia is automatically computed for each scikit-learn KMeans
object. We can access this stored value through the model’s _inertia
attribute.
Listing 12. Plotting the K-means inertia
k_values = range(1, 10) inertia_values = [KMeans(k).fit(darts).inertia_ for k in k_values] plt.plot(k_values, inertia_values) plt.xlabel('K') plt.ylabel('Inertia') plt.show()
Figure 9. An inertia plot for a dartboard simulation containing two bull’s-eyes targets. The plot resembles an arm bent at the elbow. The elbow points directly to a K of 2.
The generated plot resembles an arm bent at the elbow, and the elbow points at a K value of 2. As we already know, this K accurately captures the two centers we have preprogrammed into the dataset.
Will the approach still hold if the number of present centers is increased? We can find out by adding an additional bull’s-eye to our dart-throwing simulation. After we increase the cluster count to three, we regenerate our inertia plot (figure 10).
Listing 13. Plotting inertia for a 3-dartboard simulation
new_bulls_eye = [12, 0] for _ in range(5000): x = np.random.normal(new_bulls_eye[0], variance ** 0.5) y = np.random.normal(new_bulls_eye[1], variance ** 0.5) darts.append([x, y]) inertia_values = [KMeans(k).fit(darts).inertia_ for k in k_values] plt.plot(k_values, inertia_values) plt.xlabel('K') plt.ylabel('Inertia') plt.show()
Figure 10. An inertia plot for a dartboard simulation containing three bull’s-eyes targets. The plot resembles an arm bent at the elbow. The lowermost portion of the elbow points to a K of 3.
Adding a third center leads to a new elbow whose lowermost inclination points to a K value of 3. Essentially, our elbow plot traces the dispersion captured by each incremental K. A rapid decrease in inertia between consecutive K values implies that scattered data points have been assigned to a tighter cluster. The reduction in inertia incrementally loses its impact as the inertia curve flattens out. This transition from a vertical drop to a gentler angle leads to the presence of an elbow shape in our plot. We can use the position of the elbow to select a proper K in the K-means algorithm.
The elbow method selection criterion is a useful heuristic, but it is not guaranteed to work in every case. Under certain conditions, the elbow levels off slowly over multiple K values, making it difficult to choose a single valid cluster count.
There exist more powerful K-selection methodologies, such as the silhouette score, which captures the distance of each point to neighboring clusters. A thorough discussion of the silhouette score is beyond the scope of this book. However, you’re encouraged to explore the score on your own, using the sklearn.metrics.silhouette_score
method.
K-means clustering methods
k_means_model = KMeans(n_clusters=K)
— Creates a K-means model to search for K different centroids. We need to fit these centroids to inputted data.clusters = k_means_model.fit_predict(data)
— Executes K-means on inputted data using an initializedKMeans
object. The returnedclusters
array contains cluster IDs ranging from 0 to K. The cluster ID ofdata[i]
is equal toclusters[i]
.clusters = KMeans(n_clusters=K).fit_predict(data)
— Executes K-means in a single line of code, and returns the resulting clusters.new_clusters = k_means_model.predict(new_data)
— Finds the nearest centroids to previously unseen data using the existing centroids in a data-optimizedKMeans
object.inertia = k_means_model.inertia_
— Returns the inertia associated with a data-optimizedKMeans
object.inertia = KMeans(n_clusters=K).fit(data).inertia_
— Executes K-means in a single line of code, and returns the resulting inertia.
The elbow method isn’t perfect, but it performs reasonably well if the data is centered on K distinct means. Of course, this assumes that our data clusters differ due to centrality. However, in many instances, data clusters differ due to the density of the data points in space. Let’s explore the concept of density-driven clusters, which are not dependent on centrality.
Using density to discover clusters
Suppose an astronomer discovers a new planet at the far-flung edge of the solar system. The planet, much like Saturn, has multiple rings spinning in constant orbits around its center. Each ring is formed from thousands of rocks. We’ll model these rocks as individual points defined by x- and y coordinates. Let’s generate three rock rings composed of many rocks, using scikit-learn’s makes_circles
function (figure 11).
Listing 14. Simulating rings around a planet
from sklearn.datasets import make_circles
x_coordinates = []
y_coordinates = []
for factor in [.3, .6, 0.99]:
rock_ring, _ = make_circles(n_samples=800, factor=factor, ❶
noise=.03, random_state=1)
for rock in rock_ring:
x_coordinates.append(rock[0])
y_coordinates.append(rock[1])
plt.scatter(x_coordinates, y_coordinates)
plt.show()
❶ The make_circles function creates two concentric circles in 2D. The scale of the smaller circle’s radius relative to the larger circle is determined by the factor parameter.
Figure 11. A simulation of three rock rings positioned around a central point
Three ring groups are clearly present in the plot. Let’s search for these three clusters using K-means by setting K to 3 (figure 12).
Listing 15. Using K-means to cluster rings
rocks = [[x_coordinates[i], y_coordinates[i]] for i in range(len(x_coordinates))] rock_clusters = KMeans(3).fit_predict(rocks) colors = [['g', 'y', 'k'][cluster] for cluster in rock_clusters] plt.scatter(x_coordinates, y_coordinates, color=colors) plt.show()
Figure 12. K-means clustering fails to properly identify the three distinct rock rings.
The output is an utter failure! K-means dissects the data into three symmetric segments, and each segment spans multiple rings. The solution doesn’t align with our intuitive expectation that each ring should fall into its own distinct group. What went wrong? Well, K-means assumed that the three clusters were defined by three unique centers, but the actual rings spin around a single central point. The difference between clusters is driven not by centrality, but by density. Each ring is constructed from a dense collection of points, with empty areas of sparsely populated space serving as the boundaries between rings.
We need to design an algorithm that clusters data in dense regions of space. Doing so requires that we define whether a given region is dense or sparse. One simple definition of density is as follows: a point is in a dense region only if it’s located within a distance X of Y other points. We’ll refer to X and Y as epsilon
and min_points
, respectively. The following code sets epsilon
to 0.1 and min_points
to 10. Thus, our rocks are present in a dense region of space if they’re within a 0.1 radius of at least 10 other rocks.
Listing 16. Specifying density parameters
epsilon = 0.1 min_points = 10
Let’s analyze the density of the first rock in our rocks
list. We begin by searching for all other rocks within epsilon
units of rocks[0]
. We store the indices of these neighboring rocks in a neighbor_indices
list.
Listing 17. Finding the neighbors of rocks[0]
neighbor_indices = [i for i, rock in enumerate(rocks[1:]) if euclidean(rocks[0], rock) <= epsilon]
Now we compare the number of neighbors to min_points
to determine whether rocks[0]
lies in a dense region of space.
Listing 18. Checking the density of rocks[0]
num_neighbors = len(neighbor_indices) print(f"The rock at index 0 has {num_neighbors} neighbors.") if num_neighbors >= min_points: print("It lies in a dense region.") else: print("It does not lie in a dense region.") The rock at index 0 has 40 neighbors. It lies in a dense region.
The rock at index 0 lies in a dense region of space. Do the neighbors of rocks[0]
also share that dense region of space? This is a tricky question to answer. After all, it’s possible that every neighbor has fewer than min_points
neighbors of its own. Under our rigorous density definition, we wouldn’t consider these neighbors to be dense points. However, this would lead to a ludicrous situation in which the dense region is composed of just a single point: rocks[0]
. We can avoid such absurd outcomes by updating our density definition. Let’s formally define density as follows:
- If a point is located within
epsilon
distance ofmin_point
neighbors, then that point is in a dense region of space. - Every neighbor of a point in a dense region of space also clusters in that space.
Based on our updated definition, we can combine rocks[0]
and its neighbors into a single dense cluster.
Listing 19. Creating a dense cluster
dense_region_indices = [0] + neighbor_indices dense_region_cluster = [rocks[i] for i in dense_region_indices] dense_cluster_size = len(dense_region_cluster) print(f"We found a dense cluster containing {dense_cluster_size} rocks") We found a dense cluster containing 41 rocks
The rock at index 0 and its neighbors form a single 41-element dense cluster. Do any neighbors of the neighbors belong to a dense region of space? If so, then by our updated definition, these rocks also belong to the dense cluster. Thus, by analyzing additional neighboring points, we can expand the size of dense_region_cluster
.
Listing 20. Expanding a dense cluster
dense_region_indices = set(dense_region_indices) ❶
for index in neighbor_indices:
point = rocks[index]
neighbors_of_neighbors = [i for i, rock in enumerate(rocks)
if euclidean(point, rock) <= epsilon]
if len(neighbors_of_neighbors) >= min_points:
dense_region_indices.update(neighbors_of_neighbors)
dense_region_cluster = [rocks[i] for i in dense_region_indices]
dense_cluster_size = len(dense_region_cluster)
print(f"We expanded our cluster to include {dense_cluster_size} rocks")
We expanded our cluster to include 781 rocks
❶ Converts dense_region_indices into a set. This allows us to update the set with additional indices without worrying about duplicates.
We’ve iterated over neighbors of neighbors and expanded our dense cluster by nearly 20-fold. Why stop there? We can expand our cluster even further by analyzing the density of newly encountered neighbors. Iteratively repeating our analysis will increase the breadth of our cluster boundary. Eventually, the boundary will spread to completely encompass one of our rock rings. Then, with no new neighbors to absorb, we can repeat the iterative analysis on a rocks
element that has not been analyzed thus far. The repetition will lead to the clustering of additional dense rings.
The procedure just described is known as DBSCAN. The DBSCAN algorithm organizes data based on its spatial distribution.
Check out part 3 here. If you want to learn more about the book, check it out on Manning’s liveBook platform here.