 From Data Science Bookcamp by Leonard Apeltsin This 3-part article series covers: Clustering data by centrality Clustering data by density Trade-offs between clustering algorithms Executing clustering using the scikit-learn library Iterating over clusters using Pandas

Take 35% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.

In case you missed them, check out parts 1 and 2.

DBSCAN: A clustering algorithm for grouping data based on spatial density

DBSCAN is an acronym that stands for density-based spatial clustering of applications with noise. This is a ridiculously long name for what essentially is a very simple technique:

1. Select a random `point` coordinate from a `data` list.
2. Obtain all neighbors within `epsilon` distance of that `point`.
3. If fewer than `min_points` neighbors are discovered, repeat step 1 using a different random point. Otherwise, group `point` and its neighbors into a single cluster.
4. Iteratively repeat steps 2 and 3 across all newly discovered neighbors. All neighboring dense points are merged into the cluster. Iterations terminate after the cluster stops expanding.
5. After extracting the entire cluster, repeat steps 1-4 on all data points whose density hasn’t yet been analyzed.

The DBSCAN procedure can be programmed in less than 20 lines of code. However, any basic implementation will run very slowly on our `rocks` list. Programming a fast implementation requires some very nuanced optimizations that improve neighbor traversal speed and are beyond the scope of this book. Fortunately, there’s no need for us to rebuild the algorithm from scratch: scikit-learn provides a speedy `DBSCAN` class, which we can import from `sklearn.cluster`. Let’s import and initialize the class by assigning `epsilon` and `min_points` using the `eps` and `min_samples` parameters. Then we utilize `DBSCAN` to cluster our three rings (figure 13).

Listing 21. Using `DBSCAN` to cluster rings

```
from sklearn.cluster import DBSCAN
cluster_model = DBSCAN(eps=epsilon, min_samples=min_points) (1)
rock_clusters = cluster_model.fit_predict(rocks) (2)
colors = [['g', 'y', 'k'][cluster] for cluster in rock_clusters]
plt.scatter(x_coordinates, y_coordinates, color=colors)
plt.show()

```

Creates a cluster_model object to carry out density clustering. An epsilon value of 0.1 is passed in using the eps parameter. A min_points value of 10 is passed in using the min_samples parameter.

Clusters the rock rings based on density, and returns the assigned cluster for each rock Figure 13. DBSCAN clustering accurately identifies the three distinct rock rings.

DBSCAN has successfully identified the three rock rings. The algorithm succeeded where K-means failed.

Comparing DBSCAN and K-means

DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes. Also, unlike K-means, the algorithm doesn’t require an approximation of the cluster count before execution. Additionally, DBSCAN can filter random outliers located in sparse regions of space. For example, if we add an outlier located beyond the boundary of the rings, DBSCAN will assign it a cluster ID of -1. The negative value indicates that the outlier cannot be clustered with the rest of the dataset.

Unlike K-means, a fitted DBSCAN model cannot be reapplied to brand-new data. Instead, we need to combine new and old data and execute the clustering from scratch. This is because computed K-means centers can easily be compared to additional data points. However, the additional data points could influence the density distribution of previously seen data, which forces DBSCAN to recompute all clusters.

Listing 22. Finding outliers using DBSCAN

```
noisy_data = rocks + [[1000, -1000]]
clusters = DBSCAN(eps=epsilon,
min_samples=min_points).fit_predict(noisy_data)
assert clusters[-1] == -1

```

Another advantage of the DBSCAN technique is that it does not depend on the mean. Meanwhile, the K-means algorithm requires us to compute the mean coordinates of grouped points. As we discussed in section 5, these mean coordinates minimize the sum of squared distances to the center. The minimization property holds only if the squared distances are Euclidean. Thus, if our coordinates are not Euclidean, the mean is not very useful, and the K-means algorithm should not be applied. However, the Euclidean distance is not the only metric for gaging separation between points — infinite metrics exist for defining distance. We explore a few of them in the subsequent subsection. In the process, we learn how to integrate these metrics into our DBSCAN clustering output.

Clustering based on non-Euclidean distance

Suppose that we are visiting Manhattan and wish to know the walking distance from the Empire State Building to Columbus Circle. The Empire State Building is located at the intersection of 34th Street and Fifth Avenue. Meanwhile, Columbus Circle is located at the intersection of 57th Street and Eighth Avenue. The streets and avenues in Manhattan are always perpendicular to each other. This lets us represent Manhattan as a 2D coordinate system, where streets are positioned on the x-axis and avenues are positioned on the y-axis. Under this representation, the Empire State Building is located at coordinate (34, 5), and Columbus Circle is located at coordinate (57, 8). We can easily calculate a straight-line Euclidean distance between the two coordinate points. However, that final length would be impassable because towering steel buildings occupy the area outlined by every city block. A more correct solution is limited to a path across the perpendicular sidewalks that form the city’s grid. Such a route requires us to walk three blocks between Fifth Avenue and Third Avenue and then 23 blocks between 34th Street and 57th Street, for a total distance of 26 blocks. Manhattan’s average block length is 0.17 miles, so we can estimate the walking distance as 4.42 miles. Let’s compute that walking distance directly using a generalized `manhattan_distance` function.

Listing 23. Computing the Manhattan distance

```
def manhattan_distance(point_a, point_b):
num_blocks = np.sum(np.absolute(point_a - point_b))
return 0.17 * num_blocks

x = np.array([34, 5])
y = np.array([57, 8])
distance = manhattan_distance(x, y) (1)

print(f"Manhattan distance is {distance} miles")

```

We can also generate this output by importing cityblock from scipy.spatial.distance and then running .17 * cityblock(x, y).

```
Manhattan distance is 4.42 miles

```

Now, suppose we wish to cluster more than two Manhattan locations. We’ll assume each cluster holds a point that is within a one-mile walk of three other clustered points. This assumption lets us apply DBSCAN clustering using scikit-learn’s `DBSCAN` class. We set `eps` to 1 and `min_samples` to 3 during DBSCAN’s initialization. Furthermore, we pass `metric= manhattan_distance` into the initialization method. The `metric` parameter swaps Euclidean distance for our custom distance metric, so the clustering distance correctly reflects the grid-based constraints within the city. The following code clusters Manhattan coordinates and plots them on a grid along with their cluster designations (figure 14).

Listing 24. Clustering using Manhattan distance

```
points = [[35, 5], [33, 6], [37, 4], [40, 7], [45, 5]]
clusters = DBSCAN(eps=1, min_samples=3,
metric=manhattan_distance).fit_predict(points) ❶

for i, cluster in enumerate(clusters):
point = points[i]
if cluster == -1:
print(f"Point at index {i} is an outlier")
plt.scatter(point, point, marker='x', color='k') ❷
else:
print(f"Point at index {i} is in cluster {cluster}")
plt.scatter(point, point, color='g')

plt.grid(True, which='both', alpha=0.5) ❸
plt.minorticks_on()

plt.show()

```

The manhattan_distance function is passed into DBSCAN through the metric parameter.

Outliers are plotted using x-shaped markers.

The grid method displays the rectangular grid across which we compute Manhattan distance.

```
Point at index 0 is in cluster 0
Point at index 1 is in cluster 0
Point at index 2 is in cluster 0
Point at index 3 is an outlier
Point at index 4 is an outlier

``` Figure 14. Five points in a rectangular grid have been clustered using the Manhattan distance. The three points in the lower-left corner of the grid fall within a single cluster. The remaining two points are outliers, marked by an x.

The first three locations fall within a single cluster, and the remaining points are outliers. Could we have detected that cluster using the K-means algorithm? Perhaps. After all, our Manhattan block coordinates can be averaged out, making them compatible with a K-means implementation. What if we swap Manhattan distance for a different metric where average coordinates are not so easily obtained? Let’s define a nonlinear distance metric with the following properties: two points are 0 units apart if all their elements are negative, 2 units apart if all their elements are non-negative, and 10 units apart otherwise. Given this ridiculous measure of distance, can we compute the mean of any two arbitrary points? We can’t, and K-means cannot be applied. A weakness of the algorithm is that it depends on the existence of an average distance. Unlike K-means, the DBSCAN algorithm does not require our distance function to be linearly divisible. Thus, we can easily run DBSCAN clustering using our ridiculous distance metric.

Listing 25. Clustering using a ridiculous measure of distance

```
def ridiculous_measure(point_a, point_b):
is_negative_a = np.array(point_a) < 0 ❶
is_negative_b = np.array(point_b) < 0
if is_negative_a.all() and is_negative_b.all(): ❷
return 0
elif is_negative_a.any() or is_negative_b.any(): ❸
return 10
else: ❹
return 2

points = [[-1, -1], [-10, -10], [-1000, -13435], [3,5], [5,-7]]

clusters = DBSCAN(eps=.1, min_samples=2,
metric=ridiculous_measure).fit_predict(points)

for i, cluster in enumerate(clusters):
point = points[i]
if cluster == -1:
print(f"{point} is an outlier")
else:
print(f"{point} falls in cluster {cluster}")

```

Returns a Boolean array where is_negative_a[i] is True if point_a[i] < 0

All elements of point_a and point_b are negative.

A negative element exists, but not all elements are negative.

All elements are non-negative.

```
[-1, -1] falls in cluster 0
[-10, -10] falls in cluster 0
[-1000, -13435] falls in cluster 0
[3, 5] is an outlier
[5, -7] is an outlier

```

Running DBSCAN with our `ridiculous_measure` metric leads to the clustering of negative coordinates into a single group. All other coordinates are treated as outliers. These results are not conceptually practical, but the flexibility with regard to metric selection is much appreciated. We are not constrained in our metric choice! We could, for instance, set the metric to compute traversal distance based on the curvature of the Earth. Such a metric would be particularly useful for clustering geographic locations.

DBSCAN clustering methods

• `dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points)` — Creates a DBSCAN model to cluster by density. A dense point is defined as having at least `min_points` neighbors within a distance of `epsilon`. The neighbors are considered to be part of the same cluster as the point.
• `clusters = dbscan_model.fit_predict(data)` — Executes DBSCAN on inputted data using an initialized `DBSCAN` object. The `clusters` array contains cluster IDs. The cluster ID of `data[i]` is equal to `clusters[i]`. Unclustered outlier points are assigned an ID of -1.
• `clusters = DBSCAN(eps=epsilon, min_samples=min_points).fit_predict(data)` — Executes DBSCAN in a single line of code, and returns the resulting clusters.
• `dbscan_model = DBSCAN(eps=epsilon, min_samples=min_points, metric=metric_function)` — Creates a DBSCAN model where the distance metric is defined by a custom metric function. The `metric_function` distance metric does not need to be Euclidean.

DBSCAN does have certain drawbacks. The algorithm is intended to detect clusters with similar point-density distributions. However, real-world data varies in density. For instance, pizza shops in Manhattan are distributed more densely than pizza shops in Orange County, California. Thus, we might have trouble choosing density parameters that will let us cluster shops in both locations. This highlights another limitation of the algorithm: DBSCAN requires meaningful values for the `eps` and `min_samples` parameters. In particular, varying `eps` inputs will greatly impact the quality of clustering. Unfortunately, there is no one reliable procedure for estimating the appropriate `eps`. While certain heuristics are occasionally mentioned in the literature, their benefit is minimal. Most of the time, we must rely on our gut-level understanding of the problem to assign practical inputs to the two DBSCAN parameters. For example, if we were to cluster a set of geographic locations, our `eps` and `min_samples` values would depend on whether the locations are spread out across the entire globe or whether they are constrained to a single geographic region. In each instance, our understanding of density and distance would vary. Generally speaking, if we are clustering random cities spread out across the Earth, we can set the `min_samples` and `eps` parameters to equal three cities and 250 miles, respectively. This assumes each cluster holds a city within 250 miles of at least three other clustered cities. For a more regional location distribution, a lower `eps` value is required.

Analyzing clusters using Pandas

So far, we have kept our data inputs and clustering outputs separate. For instance, in our rock ring analysis, the input data is in the `rocks` list and the clustering output is in a `rock_clusters` array. Tracking both the coordinates and the clusters requires us to map indices between the input list and the output array. Thus, if we wish to extract all the rocks in cluster 0, we must obtain all instances of `rocks[i`] where `rock_clusters[i] == 0`. This index analysis is convoluted. We can more intuitively analyze clustered rocks by combining the coordinates and the clusters together in a single Pandas table.

The following code creates a Pandas table with three columns: `X`, `Y`, and `Cluster`. Each ith row in the table holds the x-coordinate, the y-coordinate, and the cluster of the rock located at `rocks[i]`.

Listing 26. Storing clustered coordinates in a table

```
import pandas as pd
x_coordinates, y_coordinates = np.array(rocks).T
df = pd.DataFrame({'X': x_coordinates, 'Y': y_coordinates,
'Cluster': rock_clusters})

```

Our Pandas table lets us easily access the rocks in any cluster. Let’s plot the rocks that fall into cluster 0, using techniques described in section 8 (figure 15).

Listing 27. Plotting a single cluster using Pandas

```
df_cluster = df[df.Cluster == 0] ❶
plt.scatter(df_cluster.X, df_cluster.Y) ❷
plt.show()

```

Select just those rows where the Cluster column equals 0

Plots the X and Y columns of the selected rows. Note that we also execute the scatter plot by running df_cluster.plot.scatter(x=’X’, y=’Y’). Figure 15. Rocks that fall into cluster 0

Pandas allows us to obtain a table containing elements from any single cluster. Alternatively, we might want to obtain multiple tables, where each table maps to a cluster ID. In Pandas, this is done by calling `df.groupby('Cluster')`. The `groupby` method will create three tables: one for each cluster. It will return an iterable over the mappings between cluster IDs and tables. Let’s use the `groupby` method to iterate over our three clusters. We’ll subsequently plot the rocks in cluster 1 and cluster 2, but not the rocks in cluster 0 (figure 16).

Calling `df.groupby('Cluster')` returns more than just an iterable: it returns a `DataFrameGroupBy` object, which provides additional methods for cluster filtering and analysis.

Listing 28. Iterating over clusters using Pandas

```
for cluster_id, df_cluster in df.groupby('Cluster'): ❶
if cluster_id == 0:
print(f"Skipping over cluster {cluster_id}")
continue

print(f"Plotting cluster {cluster_id}")
plt.scatter(df_cluster.X, df_cluster.Y)

plt.show()
Skipping over cluster 0
Plotting cluster 1
Plotting cluster 2

```

Each element of the iterable returned by df.groupby(‘Cluster’) is a tuple. The first element of the tuple is the cluster ID obtained from df.Cluster. The second element is a table composed of all rows where df.Cluster equals the cluster ID. Figure 16. Rocks that fall into clusters 1 and 2

The Pandas `groupby` method lets us iteratively examine different clusters. This could prove useful in our case study 3 analysis.

Summary

• The K-means algorithm clusters inputted data by searching for K centroids. These centroids represent the mean coordinates of the discovered data groups. K-means is initialized by selecting K random centroids. Each data point is then clustered based on its nearest centroid, and the centroids are iteratively recomputed until they converge on stable locations.
• K-means is guaranteed to converge to a solution. However, that solution may not be optimal.
• K-means requires Euclidean distances to distinguish between points. The algorithm is not intended to cluster non-Euclidean coordinates.
• After executing K-means clustering, we can compute the inertia of the result. Inertia equals the sum of the squared distances between each data point and its closest center.
• Plotting the inertia across a range of K values generates an elbow plot. The elbow component in the elbow-shaped plot should point downward to a reasonable K value. Using the elbow plot, we can heuristically select a meaningful K input for K-means.
• The DBSCAN algorithm clusters data based on density. Density is defined using the `epsilon` and `min_points` parameters. If a point is located within `epsilon` distance of `min_point` neighbors, then that point is in a dense region of space. Every neighbor of a point in a dense region of space also clusters in that space. DBSCAN iteratively expands the boundaries of a dense region of space until a complete cluster is detected.
• Points in non-dense regions are not clustered by the DBSCAN algorithm. They are treated as outliers.
• DBSCAN is an advantageous algorithm for clustering data composed of curving and dense shapes.
• DBSCAN can cluster using arbitrary, non-Euclidean distances.
• There is no reliable heuristic for choosing appropriate `epsilon` and `min_points` parameters. However, if we wish to cluster global cities, we can set the two parameters to 250 miles and three cities, respectively.
• Storing clustered data in a Pandas table allows us to intuitively iterate over clusters with the `groupby` method.

That’s all for this series. If you want to learn more about the book, check it out on Manning’s liveBook platform here.