## K-Means Clustering

In order to cluster Texas’ fastest growing cities by their venue categories, the first algorithm of choice was K-Means. This unsupervised learning algorithm has a straightforward approach of using the distance between data points as a measure of similarity.

After running the algorithm on the cities and their respective venue frequencies, a set of labeled data points indicating the cluster each belonged to was returned. However, the validity of these results was largely dependent upon how well the algorithm was able to distinguish clusters. In order to test this, the elbow method was employed.

The elbow method works by iterating the K-Means algorithm through a set of K-values (number of desired clusters) until there is a negligible effect on distortion reduction. Distortion can be interpreted as the average squared distance from sample points to their cluster centers. In other words, how incohesive the clusters are. At the point where distortion stops falling appreciably, an ‘elbow’ is visually apparent in the graph. This point is the optimal K-Value, or the point that represents the least amount of clusters with the least amount of distortion. Although the algorithm may still return labeled data, the lack of an elbow means K-Means is not able to identify **distinctive** clusters. In this situation, an elbow was not apparent.

## DBSCAN Algorithm + PCA

Given K-Means was unable to locate any defined clusters, it was likely that the nature of any potential clusters were not distance-related. Clusters may also be density-related, such as a group of data points that form concentric circles.

In a scenario like this, DBSCAN, or density-based spatial clustering of applications with noise, may be more appropriate. DBSCAN works by clustering areas of high density that are separated from other clusters by areas of low density.

Before making use of the algorithm, it was noted that the dimensionality of the dataset was very high. There were multiple venue category columns with different frequency values for each city’s row. It was entirely possible that this impacted K-Means clustering since processing data in higher dimensions is more challenging for a number of reasons. The dimensionality was reduced using PCA, or principal component analysis. This method works to represent as much variance in the data as possible with a select number of transformed features from the original dataset. A more in-depth explanation of PCA can be found here. Upon using PCA, the original data was represented using a two-dimensional data frame.

In order to run the DBSCAN algorithm, two parameters are always necessary: ε (epsilon) and minimum-samples. Minimum-samples refers to the least number of data points necessary to form a cluster. It is important to note that DBSCAN does not cluster all data and identifies remaining points as outliers. As a consequence, inputting a low minimum-samples value, such as two, may result in the algorithm assigning a few outliers to their own cluster. When working with two-dimensional data, it is typically a good idea to input a value of four.

The epsilon parameter on the other hand refers to the maximum distance allowed between two points that belong to the same cluster. In other words, the maximum distance to classify two points as neighbors. An optimal epsilon value is unique to its respective dataset. While there are methods known to help assist in finding the optimal, in this scenario, it was found using trial and error.

Upon running the DBSCAN algorithm with an epsilon of 0.02377 and 4 minimum samples, each data point was assigned a label: 0, 1, 2 or -1. A value of -1 indicated that the point was an outlier, while the other values represented one of three clusters the point was assigned to.

Aside from the high dimensionality of the original data, a visual of the results revealed how K-Means may have struggled to cluster points in such close proximity. This is precisely why DBSCAN was a better candidate for the situation.

## Cluster Characterization

With cluster labels for each city, there was still no information on the nature of these clusters. What made one different from the other? Why were the cities put into these clusters? Fortunately, venue category frequencies were able to answer these questions. After creating a data frame containing each city alongside its cluster label and venue category frequencies, the city column was dropped. After grouping by cluster label and aggregating the frequencies, it was then easy to see the venue category occurrences in each cluster. These results were visualized as a graph, and with a little filtering, a table displaying the most common venue categories by cluster was produced. Finally, each cluster was titled based on their respective characteristics.

(Note: Y represents Category Frequency)