How to Speed up Your K-Means Clustering by up to 10x Over Scikit-Learn

Using the Faiss library

Chire, CC BY-SA 4.0, via Wikimedia Commons

K-Means Clustering is one of the most well-known and commonly used clustering algorithms in Machine Learning. Specifically, it is an unsupervised Machine Learning algorithm, meaning that it is trained without the need for ground-truth labels. Indeed, all you have to do to use it is set the number of desired clusters K, initialize the K centroids, and then the algorithm can be executed to get the classes.

The beauty of K-Means lies in its simplicity: all it really does is compute the distances between points and group centers, resulting in a linear complexity O(n). This works perfectly fine with most datasets where you aren’t processing millions of data points.

But that’s where we run into a problem: K-Means is slow when it comes to bigger datasets as there are just so many data points to compare. What’s worse is that the most popular implementation of K-Means clustering, that of Scikit-Learn, isn’t very well optimized.

But not to worry. That’s where our new friend Faiss comes in!

Faiss is a library for fast similarity search and clustering. It was created by Facebook AI Research (FAIR) and makes very intelligent use of vectors, as well as concurrency across CPU cores to be able to speed up the computations. What’s more is that it has a GPU-component, used for even more speed on bigger datasets.

With all of that magic happening behind the scenes, Faiss still comes with a Python interface, which makes for very easy coding. In fact, once you get into it, you’ll find that the layout of the code looks incredibly similar to that of Scikit-Learn.

The easiest way to install Faiss is by using conda. First, create your conda environment and activate it:

Then you can install the Faiss library within the environment. We’ll also install scikit-learn because we’ll be comparing the speeds between the two. Finally, we’ll install Keras and TensorFlow — we’ll be using a dataset from those.

Below I installed the CPU version of Faiss, but they also have a GPU version if you’re interested. If you want to do a very fancy installation from source, you can always check out the install documentation.

Great! Now we’re ready to get going and make our K-Means super fast!

The first thing we’re going to do is run the Scikit-Learn implementation. That’s to establish the common benchmark to compare against. For your reference, the machine I’m running these tests on has the following specs:

i7–8700k CPU
32 GB of DDR4 3000MHz RAM

The first thing we have to do is our imports and data loading.

We’re going to be using the classical MNIST data. MNIST is a dataset consisting of 60,000 training images and 10,000 test images. Each image is 28×28 pixels and contains a digit (one of 0 to 9) in white pixels on a black background. It is commonly used as a quick benchmark in Machine Learning — challenging enough to be meaningful but small enough to not need a ton of computing power.

Check out the code below to see how we’ve done it. We grab the data directly from Keras, reshape it, and normalize the input to be floating-point type and have values between 0 and 1.

The next thing we’re going to do is run our scikit-learn benchmark for K-Means. Check out the code below to see how we’ve done it.

We’ve set K-Means to have 10 clusters because that’s how many class labels (the digits) our dataset has. Then, we ran our fit() function which actually creates the clusters. We measure how long this takes as the “Training time”. Next comes the predictions which we execute on the test set using the predict() function. Here we measure the total time it takes to predict on all samples as the “Prediction time”.

Nice!

In the end, the training on those 60,000 images took 21.51 seconds while the total prediction time on 10,000 images was 0.0203 seconds.

Now we’re going to see how we can do the same clustering with Faiss, but of course much faster on the CPU. The code is similar to that of scikit-learn with just a few different variable names.

Check out the code below to see how we’ve done it. Once again, we’ve set the number of clusters to 10. We also had to set two variables: niter which is equivalent to the max_iter from scikit-learn and nredo which is equivalent to n_init from scikit-learn. Also, instead of fit() the function for training is called train() and instead of predict() the function for predicting is called search().

Once again, we measure the time for both training and predictions. In this case with the Faiss library, the training on those 60,000 images took 2.33 seconds while the total prediction time was 0.0112 seconds. That’s a speedup of almost 10X on training and almost 2X on prediction!

No, not always.

In all of the tests I did on smaller datasets such as Boston, iris, and wine, there was little to no improvement at all. I reckon it’s because Faiss takes its time to set up and vectorize all of the data, so the improvements are only realized on the bigger datasets.

The Faiss team also has further recommendations if you want to run on GPU and they again point to larger datasets. This certainly makes sense since there is always a time cost to transferring that data from CPU to GPU. Thus, typically speaking, leveraging the GPU only really gets you speedups for very large datasets (i.e millions of samples).

I’d highly recommend checking out the Faiss Wiki to learn more about how to use the library. The library is incredibly well optimized for similarity search and retrieval (i.e KNN) and there are loads of examples to get you started.

Using the Faiss library

Footer