From Data Science Bookcamp by Leonard Apeltsin This 3part article series covers:

Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.
In case you missed them, check out parts 1 and 2.
DBSCAN: A clustering algorithm for grouping data based on spatial density
DBSCAN is an acronym that stands for densitybased spatial clustering of applications with noise. This is a ridiculously long name for what essentially is a very simple technique:
 Select a random
point
coordinate from adata
list.  Obtain all neighbors within
epsilon
distance of thatpoint
.  If fewer than
min_points
neighbors are discovered, repeat step 1 using a different random point. Otherwise, grouppoint
and its neighbors into a single cluster.  Iteratively repeat steps 2 and 3 across all newly discovered neighbors. All neighboring dense points are merged into the cluster. Iterations terminate after the cluster stops expanding.
 After extracting the entire cluster, repeat steps 14 on all data points whose density hasn’t yet been analyzed.
The DBSCAN procedure can be programmed in less than 20 lines of code. However, any basic implementation will run very slowly on our rocks
list. Programming a fast implementation requires some very nuanced optimizations that improve neighbor traversal speed and are beyond the scope of this book. Fortunately, there’s no need for us to rebuild the algorithm from scratch: scikitlearn provides a speedy DBSCAN
class, which we can import from sklearn.cluster
. Let’s import and initialize the class by assigning epsilon
and min_points
using the eps
and min_samples
parameters. Then we utilize DBSCAN
to cluster our three rings (figure 13).
Listing 21. Using DBSCAN
to cluster rings
from sklearn.cluster import DBSCAN cluster_model = DBSCAN(eps=epsilon, min_samples=min_points) (1) rock_clusters = cluster_model.fit_predict(rocks) (2) colors = [['g', 'y', 'k'][cluster] for cluster in rock_clusters] plt.scatter(x_coordinates, y_coordinates, color=colors) plt.show()
❶ Creates a cluster_model object to carry out density clustering. An epsilon value of 0.1 is passed in using the eps parameter. A min_points value of 10 is passed in using the min_samples parameter.
❷ Clusters the rock rings based on density, and returns the assigned cluster for each rock
Figure 13. DBSCAN clustering accurately identifies the three distinct rock rings.
DBSCAN has successfully identified the three rock rings. The algorithm succeeded where Kmeans failed.
Comparing DBSCAN and Kmeans
DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes. Also, unlike Kmeans, the algorithm doesn’t require an approximation of the cluster count before execution. Additionally, DBSCAN can filter random outliers located in sparse regions of space. For example, if we add an outlier located beyond the boundary of the rings, DBSCAN will assign it a cluster ID of 1. The negative value indicates that the outlier cannot be clustered with the rest of the dataset.
Unlike Kmeans, a fitted DBSCAN model cannot be reapplied to brandnew data. Instead, we need to combine new and old data and execute the clustering from scratch. This is because computed Kmeans centers can easily be compared to additional data points. However, the additional data points could influence the density distribution of previously seen data, which forces DBSCAN to recompute all clusters.
Listing 22. Finding outliers using DBSCAN
noisy_data = rocks + [[1000, 1000]] clusters = DBSCAN(eps=epsilon, min_samples=min_points).fit_predict(noisy_data) assert clusters[1] == 1
Another advantage of the DBSCAN technique is that it does not depend on the mean. Meanwhile, the Kmeans algorithm requires us to compute the mean coordinates of grouped points. As we discussed in section 5, these mean coordinates minimize the sum of squared distances to the center. The minimization property holds only if the squared distances are Euclidean. Thus, if our coordinates are not Euclidean, the mean is not very useful, and the Kmeans algorithm should not be applied. However, the Euclidean distance is not the only metric for gaging separation between points — infinite metrics exist for defining distance. We explore a few of them in the subsequent subsection. In the process, we learn how to integrate these metrics into our DBSCAN clustering output.
Clustering based on nonEuclidean distance
Suppose that we are visiting Manhattan and wish to know the walking distance from the Empire State Building to Columbus Circle. The Empire State Building is located at the intersection of 34th Street and Fifth Avenue. Meanwhile, Columbus Circle is located at the intersection of 57th Street and Eighth Avenue. The streets and avenues in Manhattan are always perpendicular to each other. This lets us represent Manhattan as a 2D coordinate system, where streets are positioned on the xaxis and avenues are positioned on the yaxis. Under this representation, the Empire State Building is located at coordinate (34, 5), and Columbus Circle is located at coordinate (57, 8). We can easily calculate a straightline Euclidean distance between the two coordinate points. However, that final length would be impassable because towering steel buildings occupy the area outlined by every city block. A more correct solution is limited to a path across the perpendicular sidewalks that form the city’s grid. Such a route requires us to walk three blocks between Fifth Avenue and Third Avenue and then 23 blocks between 34th Street and 57th Street, for a total distance of 26 blocks. Manhattan’s average block length is 0.17 miles, so we can estimate the walking distance as 4.42 miles. Let’s compute that walking distance directly using a generalized manhattan_distance
function.
Listing 23. Computing the Manhattan distance
def manhattan_distance(point_a, point_b): num_blocks = np.sum(np.absolute(point_a  point_b)) return 0.17 * num_blocks x = np.array([34, 5]) y = np.array([57, 8]) distance = manhattan_distance(x, y) (1) print(f"Manhattan distance is {distance} miles")
❶ We can also generate this output by importing cityblock from scipy.spatial.distance and then running .17 * cityblock(x, y).
Manhattan distance is 4.42 miles
Now, suppose we wish to cluster more than two Manhattan locations. We’ll assume each cluster holds a point that is within a onemile walk of three other clustered points. This assumption lets us apply DBSCAN clustering using scikitlearn’s DBSCAN
class. We set eps
to 1 and min_samples
to 3 during DBSCAN’s initialization. Furthermore, we pass metric= manhattan_distance
into the initialization method. The metric
parameter swaps Euclidean distance for our custom distance metric, so the clustering distance correctly reflects the gridbased constraints within the city. The following code clusters Manhattan coordinates and plots them on a grid along with their cluster designations (figure 14).
Listing 24. Clustering using Manhattan distance
points = [[35, 5], [33, 6], [37, 4], [40, 7], [45, 5]] clusters = DBSCAN(eps=1, min_samples=3, metric=manhattan_distance).fit_predict(points) ❶ for i, cluster in enumerate(clusters): point = points[i] if cluster == 1: print(f"Point at index {i} is an outlier") plt.scatter(point[0], point[1], marker='x', color='k') ❷ else: print(f"Point at index {i} is in cluster {cluster}") plt.scatter(point[0], point[1], color='g') plt.grid(True, which='both', alpha=0.5) ❸ plt.minorticks_on() plt.show()
❶ The manhattan_distance function is passed into DBSCAN through the metric parameter.
❷ Outliers are plotted using xshaped markers.
❸ The grid method displays the rectangular grid across which we compute Manhattan distance.
Point at index 0 is in cluster 0 Point at index 1 is in cluster 0 Point at index 2 is in cluster 0 Point at index 3 is an outlier Point at index 4 is an outlier
Figure 14. Five points in a rectangular grid have been clustered using the Manhattan distance. The three points in the lowerleft corner of the grid fall within a single cluster. The remaining two points are outliers, marked by an x.
The first three locations fall within a single cluster, and the remaining points are outliers. Could we have detected that cluster using the Kmeans algorithm? Perhaps. After all, our Manhattan block coordinates can be averaged out, making them compatible with a Kmeans implementation. What if we swap Manhattan distance for a different metric where average coordinates are not so easily obtained? Let’s define a nonlinear distance metric with the following properties: two points are 0 units apart if all their elements are negative, 2 units apart if all their elements are nonnegative, and 10 units apart otherwise. Given this ridiculous measure of distance, can we compute the mean of any two arbitrary points? We can’t, and Kmeans cannot be applied. A weakness of the algorithm is that it depends on the existence of an average distance. Unlike Kmeans, the DBSCAN algorithm does not require our distance function to be linearly divisible. Thus, we can easily run DBSCAN clustering using our ridiculous distance metric.
Listing 25. Clustering using a ridiculous measure of distance
def ridiculous_measure(point_a, point_b): is_negative_a = np.array(point_a) < 0 ❶ is_negative_b = np.array(point_b) < 0 if is_negative_a.all() and is_negative_b.all(): ❷ return 0 elif is_negative_a.any() or is_negative_b.any(): ❸ return 10 else: ❹ return 2 points = [[1, 1], [10, 10], [1000, 13435], [3,5], [5,7]] clusters = DBSCAN(eps=.1, min_samples=2, metric=ridiculous_measure).fit_predict(points) for i, cluster in enumerate(clusters): point = points[i] if cluster == 1: print(f"{point} is an outlier") else: print(f"{point} falls in cluster {cluster}")
❶ Returns a Boolean array where is_negative_a[i] is True if point_a[i] < 0
❷ All elements of point_a and point_b are negative.
❸ A negative element exists, but not all elements are negative.
❹ All elements are nonnegative.
[1, 1] falls in cluster 0 [10, 10] falls in cluster 0 [1000, 13435] falls in cluster 0 [3, 5] is an outlier [5, 7] is an outlier
Running DBSCAN with our ridiculous_measure
metric leads to the clustering of negative coordinates into a single group. All other coordinates are treated as outliers. These results are not conceptually practical, but the flexibility with regard to metric selection is much appreciated. We are not constrained in our metric choice! We could, for instance, set the metric to compute traversal distance based on the curvature of the Earth. Such a metric would be particularly useful for clustering geographic locations.
DBSCAN clustering methods
dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points)
— Creates a DBSCAN model to cluster by density. A dense point is defined as having at leastmin_points
neighbors within a distance ofepsilon
. The neighbors are considered to be part of the same cluster as the point.clusters = dbscan_model.fit_predict(data)
— Executes DBSCAN on inputted data using an initializedDBSCAN
object. Theclusters
array contains cluster IDs. The cluster ID ofdata[i]
is equal toclusters[i]
. Unclustered outlier points are assigned an ID of 1.clusters = DBSCAN(eps=epsilon, min_samples=min_points).fit_predict(data)
— Executes DBSCAN in a single line of code, and returns the resulting clusters.dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points, metric=metric_function)
— Creates a DBSCAN model where the distance metric is defined by a custom metric function. Themetric_function
distance metric does not need to be Euclidean.
DBSCAN does have certain drawbacks. The algorithm is intended to detect clusters with similar pointdensity distributions. However, realworld data varies in density. For instance, pizza shops in Manhattan are distributed more densely than pizza shops in Orange County, California. Thus, we might have trouble choosing density parameters that will let us cluster shops in both locations. This highlights another limitation of the algorithm: DBSCAN requires meaningful values for the eps
and min_samples
parameters. In particular, varying eps
inputs will greatly impact the quality of clustering. Unfortunately, there is no one reliable procedure for estimating the appropriate eps
. While certain heuristics are occasionally mentioned in the literature, their benefit is minimal. Most of the time, we must rely on our gutlevel understanding of the problem to assign practical inputs to the two DBSCAN parameters. For example, if we were to cluster a set of geographic locations, our eps
and min_samples
values would depend on whether the locations are spread out across the entire globe or whether they are constrained to a single geographic region. In each instance, our understanding of density and distance would vary. Generally speaking, if we are clustering random cities spread out across the Earth, we can set the min_samples
and eps
parameters to equal three cities and 250 miles, respectively. This assumes each cluster holds a city within 250 miles of at least three other clustered cities. For a more regional location distribution, a lower eps
value is required.
Analyzing clusters using Pandas
So far, we have kept our data inputs and clustering outputs separate. For instance, in our rock ring analysis, the input data is in the rocks
list and the clustering output is in a rock_clusters
array. Tracking both the coordinates and the clusters requires us to map indices between the input list and the output array. Thus, if we wish to extract all the rocks in cluster 0, we must obtain all instances of rocks[i
] where rock_clusters[i] == 0
. This index analysis is convoluted. We can more intuitively analyze clustered rocks by combining the coordinates and the clusters together in a single Pandas table.
The following code creates a Pandas table with three columns: X
, Y
, and Cluster
. Each ith row in the table holds the xcoordinate, the ycoordinate, and the cluster of the rock located at rocks[i]
.
Listing 26. Storing clustered coordinates in a table
import pandas as pd x_coordinates, y_coordinates = np.array(rocks).T df = pd.DataFrame({'X': x_coordinates, 'Y': y_coordinates, 'Cluster': rock_clusters})
Our Pandas table lets us easily access the rocks in any cluster. Let’s plot the rocks that fall into cluster 0, using techniques described in section 8 (figure 15).
Listing 27. Plotting a single cluster using Pandas
df_cluster = df[df.Cluster == 0] ❶ plt.scatter(df_cluster.X, df_cluster.Y) ❷ plt.show()
❶ Select just those rows where the Cluster column equals 0
❷ Plots the X and Y columns of the selected rows. Note that we also execute the scatter plot by running df_cluster.plot.scatter(x=’X’, y=’Y’).
Figure 15. Rocks that fall into cluster 0
Pandas allows us to obtain a table containing elements from any single cluster. Alternatively, we might want to obtain multiple tables, where each table maps to a cluster ID. In Pandas, this is done by calling df.groupby('Cluster')
. The groupby
method will create three tables: one for each cluster. It will return an iterable over the mappings between cluster IDs and tables. Let’s use the groupby
method to iterate over our three clusters. We’ll subsequently plot the rocks in cluster 1 and cluster 2, but not the rocks in cluster 0 (figure 16).
Calling df.groupby('Cluster')
returns more than just an iterable: it returns a DataFrameGroupBy
object, which provides additional methods for cluster filtering and analysis.
Listing 28. Iterating over clusters using Pandas
for cluster_id, df_cluster in df.groupby('Cluster'): ❶
if cluster_id == 0:
print(f"Skipping over cluster {cluster_id}")
continue
print(f"Plotting cluster {cluster_id}")
plt.scatter(df_cluster.X, df_cluster.Y)
plt.show()
Skipping over cluster 0
Plotting cluster 1
Plotting cluster 2
❶ Each element of the iterable returned by df.groupby(‘Cluster’) is a tuple. The first element of the tuple is the cluster ID obtained from df.Cluster. The second element is a table composed of all rows where df.Cluster equals the cluster ID.
Figure 16. Rocks that fall into clusters 1 and 2
The Pandas groupby
method lets us iteratively examine different clusters. This could prove useful in our case study 3 analysis.
Summary
 The Kmeans algorithm clusters inputted data by searching for K centroids. These centroids represent the mean coordinates of the discovered data groups. Kmeans is initialized by selecting K random centroids. Each data point is then clustered based on its nearest centroid, and the centroids are iteratively recomputed until they converge on stable locations.
 Kmeans is guaranteed to converge to a solution. However, that solution may not be optimal.
 Kmeans requires Euclidean distances to distinguish between points. The algorithm is not intended to cluster nonEuclidean coordinates.
 After executing Kmeans clustering, we can compute the inertia of the result. Inertia equals the sum of the squared distances between each data point and its closest center.
 Plotting the inertia across a range of K values generates an elbow plot. The elbow component in the elbowshaped plot should point downward to a reasonable K value. Using the elbow plot, we can heuristically select a meaningful K input for Kmeans.
 The DBSCAN algorithm clusters data based on density. Density is defined using the
epsilon
andmin_points
parameters. If a point is located withinepsilon
distance ofmin_point
neighbors, then that point is in a dense region of space. Every neighbor of a point in a dense region of space also clusters in that space. DBSCAN iteratively expands the boundaries of a dense region of space until a complete cluster is detected.  Points in nondense regions are not clustered by the DBSCAN algorithm. They are treated as outliers.
 DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes.
 DBSCAN can cluster using arbitrary, nonEuclidean distances.
 There is no reliable heuristic for choosing appropriate
epsilon
andmin_points
parameters. However, if we wish to cluster global cities, we can set the two parameters to 250 miles and three cities, respectively.  Storing clustered data in a Pandas table allows us to intuitively iterate over clusters with the
groupby
method.
That’s all for this series. If you want to learn more about the book, check it out on Manning’s liveBook platform here.