## The number of clusters as one of the input parameters

When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.

We will perform a DBSCAN analysis using the **sklearn.cluster.DBSCAN**

function. We will use the same data that we used in the previous Evaluating the performance of clustering algorithms ( data_perf.txt ) which can be downloaded from here https://github.com/appyavi/Dataset

Let’s see how to automatically estimate the number of clusters using the DBSCAN algorithm:

- import the necessary packages:

`from itertools import cycle`

import numpy as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

import matplotlib.pyplot as plt

2. Load the input data from the data_perf.txt file.

`# Load data`

input_file = ('data_perf.txt')

x = []

with open(input_file, 'r') as f:

for line in f.readlines():

data = [float(i) for i in line.split(',')]

x.append(data)

X = np.array(x)

3. We need to find the best parameter, so let’s initialize a few variables:

`# Find the best epsilon`

eps_grid = np.linspace(0.3, 1.2, num=10)

silhouette_scores = []

eps_best = eps_grid[0]

silhouette_score_max = -1

model_best = None

labels_best = None

4. Let’s sweep the parameter space:

`for eps in eps_grid:`

# Train DBSCAN clustering model

model = DBSCAN(eps=eps, min_samples=5).fit(X)

# Extract labels

labels = model.labels_

5. For each iteration, we need to extract the performance metric:

`# Extract performance metric`

silhouette_score = round(metrics.silhouette_score(X, labels),4)

silhouette_scores.append(silhouette_score)

print("Epsilon:", eps, " --> silhouette score:",silhouette_score)

6. We need to store the best score and its associated epsilon value:

`if silhouette_score > silhouette_score_max:`

silhouette_score_max = silhouette_score

eps_best = eps

model_best = model

labels_best = labels

output:

**Epsilon: 0.3 --> silhouette score: 0.1287**

Epsilon: 0.39999999999999997 --> silhouette score: 0.3594

Epsilon: 0.5 --> silhouette score: 0.5134

Epsilon: 0.6 --> silhouette score: 0.6165

Epsilon: 0.7 --> silhouette score: 0.6322

Epsilon: 0.7999999999999999 --> silhouette score: 0.6366

Epsilon: 0.8999999999999999 --> silhouette score: 0.5142

Epsilon: 1.0 --> silhouette score: 0.5629

Epsilon: 1.0999999999999999 --> silhouette score: 0.5629

Epsilon: 1.2 --> silhouette score: 0.5629

7. Let’s now plot the bar graph, as follows:

# Plot silhouette scores vs epsilon

plt.figure()

plt.bar(eps_grid, silhouette_scores, width=0.05, color='k', align='center')

plt.title('Silhouette score vs epsilon')

plt.show()# Best params

print("Best epsilon =", eps_best)