Description: https://images.manning.com/360/480/resize/book/c/755d035-1734-40d2-a056-04d2b7f5c84d/Apeltsin-DSB-HI.png

From Data Science Bookcamp by Leonard Apeltsin

This 3-part article series covers:

  • Clustering data by centrality
  • Clustering data by density
  • Trade-offs between clustering algorithms
  • Executing clustering using the scikit-learn library
  • Iterating over clusters using Pandas


Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.


In case you missed them, check out parts 1 and 2.

DBSCAN: A clustering algorithm for grouping data based on spatial density

DBSCAN is an acronym that stands for density-based spatial clustering of applications with noise. This is a ridiculously long name for what essentially is a very simple technique:

  1. Select a random point coordinate from a data list.
  2. Obtain all neighbors within epsilon distance of that point.
  3. If fewer than min_points neighbors are discovered, repeat step 1 using a different random point. Otherwise, group point and its neighbors into a single cluster.
  4. Iteratively repeat steps 2 and 3 across all newly discovered neighbors. All neighboring dense points are merged into the cluster. Iterations terminate after the cluster stops expanding.
  5. After extracting the entire cluster, repeat steps 1-4 on all data points whose density hasn’t yet been analyzed.

The DBSCAN procedure can be programmed in less than 20 lines of code. However, any basic implementation will run very slowly on our rocks list. Programming a fast implementation requires some very nuanced optimizations that improve neighbor traversal speed and are beyond the scope of this book. Fortunately, there’s no need for us to rebuild the algorithm from scratch: scikit-learn provides a speedy DBSCAN class, which we can import from sklearn.cluster. Let’s import and initialize the class by assigning epsilon and min_points using the eps and min_samples parameters. Then we utilize DBSCAN to cluster our three rings (figure 13).

Listing 21. Using DBSCAN to cluster rings

 
 from sklearn.cluster import DBSCAN
 cluster_model = DBSCAN(eps=epsilon, min_samples=min_points) (1)
 rock_clusters = cluster_model.fit_predict(rocks) (2)
 colors = [['g', 'y', 'k'][cluster] for cluster in rock_clusters]
 plt.scatter(x_coordinates, y_coordinates, color=colors)
 plt.show()
  

Creates a cluster_model object to carry out density clustering. An epsilon value of 0.1 is passed in using the eps parameter. A min_points value of 10 is passed in using the min_samples parameter.

Clusters the rock rings based on density, and returns the assigned cluster for each rock


Figure 13. DBSCAN clustering accurately identifies the three distinct rock rings.


DBSCAN has successfully identified the three rock rings. The algorithm succeeded where K-means failed.

Comparing DBSCAN and K-means

DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes. Also, unlike K-means, the algorithm doesn’t require an approximation of the cluster count before execution. Additionally, DBSCAN can filter random outliers located in sparse regions of space. For example, if we add an outlier located beyond the boundary of the rings, DBSCAN will assign it a cluster ID of -1. The negative value indicates that the outlier cannot be clustered with the rest of the dataset.

Unlike K-means, a fitted DBSCAN model cannot be reapplied to brand-new data. Instead, we need to combine new and old data and execute the clustering from scratch. This is because computed K-means centers can easily be compared to additional data points. However, the additional data points could influence the density distribution of previously seen data, which forces DBSCAN to recompute all clusters.  

Listing 22. Finding outliers using DBSCAN

 
 noisy_data = rocks + [[1000, -1000]]
 clusters = DBSCAN(eps=epsilon,
                   min_samples=min_points).fit_predict(noisy_data)
 assert clusters[-1] == -1
  

Another advantage of the DBSCAN technique is that it does not depend on the mean. Meanwhile, the K-means algorithm requires us to compute the mean coordinates of grouped points. As we discussed in section 5, these mean coordinates minimize the sum of squared distances to the center. The minimization property holds only if the squared distances are Euclidean. Thus, if our coordinates are not Euclidean, the mean is not very useful, and the K-means algorithm should not be applied. However, the Euclidean distance is not the only metric for gaging separation between points — infinite metrics exist for defining distance. We explore a few of them in the subsequent subsection. In the process, we learn how to integrate these metrics into our DBSCAN clustering output.

Clustering based on non-Euclidean distance

Suppose that we are visiting Manhattan and wish to know the walking distance from the Empire State Building to Columbus Circle. The Empire State Building is located at the intersection of 34th Street and Fifth Avenue. Meanwhile, Columbus Circle is located at the intersection of 57th Street and Eighth Avenue. The streets and avenues in Manhattan are always perpendicular to each other. This lets us represent Manhattan as a 2D coordinate system, where streets are positioned on the x-axis and avenues are positioned on the y-axis. Under this representation, the Empire State Building is located at coordinate (34, 5), and Columbus Circle is located at coordinate (57, 8). We can easily calculate a straight-line Euclidean distance between the two coordinate points. However, that final length would be impassable because towering steel buildings occupy the area outlined by every city block. A more correct solution is limited to a path across the perpendicular sidewalks that form the city’s grid. Such a route requires us to walk three blocks between Fifth Avenue and Third Avenue and then 23 blocks between 34th Street and 57th Street, for a total distance of 26 blocks. Manhattan’s average block length is 0.17 miles, so we can estimate the walking distance as 4.42 miles. Let’s compute that walking distance directly using a generalized manhattan_distance function.

Listing 23. Computing the Manhattan distance

 
 def manhattan_distance(point_a, point_b):
     num_blocks = np.sum(np.absolute(point_a - point_b))
     return 0.17 * num_blocks
  
 x = np.array([34, 5])
 y = np.array([57, 8])
 distance = manhattan_distance(x, y) (1)
  
 print(f"Manhattan distance is {distance} miles")
  

We can also generate this output by importing cityblock from scipy.spatial.distance and then running .17 * cityblock(x, y).

 
 Manhattan distance is 4.42 miles
  

Now, suppose we wish to cluster more than two Manhattan locations. We’ll assume each cluster holds a point that is within a one-mile walk of three other clustered points. This assumption lets us apply DBSCAN clustering using scikit-learn’s DBSCAN class. We set eps to 1 and min_samples to 3 during DBSCAN’s initialization. Furthermore, we pass metric= manhattan_distance into the initialization method. The metric parameter swaps Euclidean distance for our custom distance metric, so the clustering distance correctly reflects the grid-based constraints within the city. The following code clusters Manhattan coordinates and plots them on a grid along with their cluster designations (figure 14).

Listing 24. Clustering using Manhattan distance

 
 points = [[35, 5], [33, 6], [37, 4], [40, 7], [45, 5]]
 clusters = DBSCAN(eps=1, min_samples=3,
                   metric=manhattan_distance).fit_predict(points) 
  
 for i, cluster in enumerate(clusters):
     point = points[i]
     if cluster == -1:
         print(f"Point at index {i} is an outlier")
         plt.scatter(point[0], point[1], marker='x', color='k') 
     else:
         print(f"Point at index {i} is in cluster {cluster}")
         plt.scatter(point[0], point[1], color='g')
  
 plt.grid(True, which='both', alpha=0.5) 
 plt.minorticks_on()
  
  
 plt.show()
  

The manhattan_distance function is passed into DBSCAN through the metric parameter.

Outliers are plotted using x-shaped markers.

The grid method displays the rectangular grid across which we compute Manhattan distance.

 
 Point at index 0 is in cluster 0
 Point at index 1 is in cluster 0
 Point at index 2 is in cluster 0
 Point at index 3 is an outlier
 Point at index 4 is an outlier
  

Figure 14. Five points in a rectangular grid have been clustered using the Manhattan distance. The three points in the lower-left corner of the grid fall within a single cluster. The remaining two points are outliers, marked by an x.


The first three locations fall within a single cluster, and the remaining points are outliers. Could we have detected that cluster using the K-means algorithm? Perhaps. After all, our Manhattan block coordinates can be averaged out, making them compatible with a K-means implementation. What if we swap Manhattan distance for a different metric where average coordinates are not so easily obtained? Let’s define a nonlinear distance metric with the following properties: two points are 0 units apart if all their elements are negative, 2 units apart if all their elements are non-negative, and 10 units apart otherwise. Given this ridiculous measure of distance, can we compute the mean of any two arbitrary points? We can’t, and K-means cannot be applied. A weakness of the algorithm is that it depends on the existence of an average distance. Unlike K-means, the DBSCAN algorithm does not require our distance function to be linearly divisible. Thus, we can easily run DBSCAN clustering using our ridiculous distance metric.

Listing 25. Clustering using a ridiculous measure of distance

 
 def ridiculous_measure(point_a, point_b):
     is_negative_a = np.array(point_a) < 0 
     is_negative_b = np.array(point_b) < 0
     if is_negative_a.all() and is_negative_b.all(): 
         return 0
     elif is_negative_a.any() or is_negative_b.any(): 
         return 10
     else: 
         return 2
  
 points = [[-1, -1], [-10, -10], [-1000, -13435], [3,5], [5,-7]]
  
 clusters = DBSCAN(eps=.1, min_samples=2,
                   metric=ridiculous_measure).fit_predict(points)
  
 for i, cluster in enumerate(clusters):
     point = points[i]
     if cluster == -1:
         print(f"{point} is an outlier")
     else:
         print(f"{point} falls in cluster {cluster}")
  

Returns a Boolean array where is_negative_a[i] is True if point_a[i] < 0

All elements of point_a and point_b are negative.

A negative element exists, but not all elements are negative.

All elements are non-negative.

 
 [-1, -1] falls in cluster 0
 [-10, -10] falls in cluster 0
 [-1000, -13435] falls in cluster 0
 [3, 5] is an outlier
 [5, -7] is an outlier
  

Running DBSCAN with our ridiculous_measure metric leads to the clustering of negative coordinates into a single group. All other coordinates are treated as outliers. These results are not conceptually practical, but the flexibility with regard to metric selection is much appreciated. We are not constrained in our metric choice! We could, for instance, set the metric to compute traversal distance based on the curvature of the Earth. Such a metric would be particularly useful for clustering geographic locations.

DBSCAN clustering methods

  • dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points) — Creates a DBSCAN model to cluster by density. A dense point is defined as having at least min_points neighbors within a distance of epsilon. The neighbors are considered to be part of the same cluster as the point.
  • clusters = dbscan_model.fit_predict(data) — Executes DBSCAN on inputted data using an initialized DBSCAN object. The clusters array contains cluster IDs. The cluster ID of data[i] is equal to clusters[i]. Unclustered outlier points are assigned an ID of -1.
  • clusters = DBSCAN(eps=epsilon, min_samples=min_points).fit_predict(data) — Executes DBSCAN in a single line of code, and returns the resulting clusters.
  • dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points, metric=metric_function) — Creates a DBSCAN model where the distance metric is defined by a custom metric function. The metric_function distance metric does not need to be Euclidean.

DBSCAN does have certain drawbacks. The algorithm is intended to detect clusters with similar point-density distributions. However, real-world data varies in density. For instance, pizza shops in Manhattan are distributed more densely than pizza shops in Orange County, California. Thus, we might have trouble choosing density parameters that will let us cluster shops in both locations. This highlights another limitation of the algorithm: DBSCAN requires meaningful values for the eps and min_samples parameters. In particular, varying eps inputs will greatly impact the quality of clustering. Unfortunately, there is no one reliable procedure for estimating the appropriate eps. While certain heuristics are occasionally mentioned in the literature, their benefit is minimal. Most of the time, we must rely on our gut-level understanding of the problem to assign practical inputs to the two DBSCAN parameters. For example, if we were to cluster a set of geographic locations, our eps and min_samples values would depend on whether the locations are spread out across the entire globe or whether they are constrained to a single geographic region. In each instance, our understanding of density and distance would vary. Generally speaking, if we are clustering random cities spread out across the Earth, we can set the min_samples and eps parameters to equal three cities and 250 miles, respectively. This assumes each cluster holds a city within 250 miles of at least three other clustered cities. For a more regional location distribution, a lower eps value is required.

Analyzing clusters using Pandas

So far, we have kept our data inputs and clustering outputs separate. For instance, in our rock ring analysis, the input data is in the rocks list and the clustering output is in a rock_clusters array. Tracking both the coordinates and the clusters requires us to map indices between the input list and the output array. Thus, if we wish to extract all the rocks in cluster 0, we must obtain all instances of rocks[i] where rock_clusters[i] == 0. This index analysis is convoluted. We can more intuitively analyze clustered rocks by combining the coordinates and the clusters together in a single Pandas table.

The following code creates a Pandas table with three columns: X, Y, and Cluster. Each ith row in the table holds the x-coordinate, the y-coordinate, and the cluster of the rock located at rocks[i].

Listing 26. Storing clustered coordinates in a table

 
 import pandas as pd
 x_coordinates, y_coordinates = np.array(rocks).T
 df = pd.DataFrame({'X': x_coordinates, 'Y': y_coordinates,
                    'Cluster': rock_clusters})
  

Our Pandas table lets us easily access the rocks in any cluster. Let’s plot the rocks that fall into cluster 0, using techniques described in section 8 (figure 15).

Listing 27. Plotting a single cluster using Pandas

 
 df_cluster = df[df.Cluster == 0] 
 plt.scatter(df_cluster.X, df_cluster.Y) 
 plt.show()
  

Select just those rows where the Cluster column equals 0

Plots the X and Y columns of the selected rows. Note that we also execute the scatter plot by running df_cluster.plot.scatter(x=’X’, y=’Y’).


Figure 15. Rocks that fall into cluster 0


Pandas allows us to obtain a table containing elements from any single cluster. Alternatively, we might want to obtain multiple tables, where each table maps to a cluster ID. In Pandas, this is done by calling df.groupby('Cluster'). The groupby method will create three tables: one for each cluster. It will return an iterable over the mappings between cluster IDs and tables. Let’s use the groupby method to iterate over our three clusters. We’ll subsequently plot the rocks in cluster 1 and cluster 2, but not the rocks in cluster 0 (figure 16).

Calling df.groupby('Cluster') returns more than just an iterable: it returns a DataFrameGroupBy object, which provides additional methods for cluster filtering and analysis.  

Listing 28. Iterating over clusters using Pandas

 
 for cluster_id, df_cluster in df.groupby('Cluster'): 
     if cluster_id == 0:
         print(f"Skipping over cluster {cluster_id}")
         continue
  
     print(f"Plotting cluster {cluster_id}")
     plt.scatter(df_cluster.X, df_cluster.Y)
  
  
 plt.show()
 Skipping over cluster 0
 Plotting cluster 1
 Plotting cluster 2
  

Each element of the iterable returned by df.groupby(‘Cluster’) is a tuple. The first element of the tuple is the cluster ID obtained from df.Cluster. The second element is a table composed of all rows where df.Cluster equals the cluster ID.


Figure 16. Rocks that fall into clusters 1 and 2


The Pandas groupby method lets us iteratively examine different clusters. This could prove useful in our case study 3 analysis.

Summary

  • The K-means algorithm clusters inputted data by searching for K centroids. These centroids represent the mean coordinates of the discovered data groups. K-means is initialized by selecting K random centroids. Each data point is then clustered based on its nearest centroid, and the centroids are iteratively recomputed until they converge on stable locations.
  • K-means is guaranteed to converge to a solution. However, that solution may not be optimal.
  • K-means requires Euclidean distances to distinguish between points. The algorithm is not intended to cluster non-Euclidean coordinates.
  • After executing K-means clustering, we can compute the inertia of the result. Inertia equals the sum of the squared distances between each data point and its closest center.
  • Plotting the inertia across a range of K values generates an elbow plot. The elbow component in the elbow-shaped plot should point downward to a reasonable K value. Using the elbow plot, we can heuristically select a meaningful K input for K-means.
  • The DBSCAN algorithm clusters data based on density. Density is defined using the epsilon and min_points parameters. If a point is located within epsilon distance of min_point neighbors, then that point is in a dense region of space. Every neighbor of a point in a dense region of space also clusters in that space. DBSCAN iteratively expands the boundaries of a dense region of space until a complete cluster is detected.
  • Points in non-dense regions are not clustered by the DBSCAN algorithm. They are treated as outliers.
  • DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes.
  • DBSCAN can cluster using arbitrary, non-Euclidean distances.
  • There is no reliable heuristic for choosing appropriate epsilon and min_points parameters. However, if we wish to cluster global cities, we can set the two parameters to 250 miles and three cities, respectively.
  • Storing clustered data in a Pandas table allows us to intuitively iterate over clusters with the groupby method.

That’s all for this series. If you want to learn more about the book, check it out on Manning’s liveBook platform here.