TPUs are hardware accelerators specialized in deep learning tasks. In this code lab, you will see how to use them with Keras and Tensorflow 2. Cloud TPUs are available in a base configuration with 8 cores and also in larger configurations called “TPU pods” of up to 2048 cores. The extra hardware can be used to accelerate training by increasing the training batch size.
We all know that GPUs are faster than CPUs when it comes to machine learning. And during the last few years, we can see that there are new chips being developed by giants in the industry such as Nvidia and ARM to optimize machine learning tensor (matrix) operations which brings us to Tensor Processing Units (TPUs). You probably came across them when doing a Kaggle competition or when using an online GPU provider such as Google colab.
In this article, I want to explain why TPUs are faster than GPUs, what enables them to perform quicker tensor operations making them the fastest ML engines.
To put things into perspective we have to realize that each chip is optimized for what it’s built for. It’s probably more of a coincidence that GPUs are quicker than CPUs for tasks like machine learning and cryptocurrency mining given that GPUs were originally built for graphics rendering (hence Graphics Processing Units). I think before we start diving into TPUs, it’s worth explaining why GPUs are quicker than CPUs.
GPUs vs CPUs
Even though GPUs typically have smaller cores compared to GPUs, they have many more of them. These cores contain arithmetic logic units (ALUs), control units, and memory caches which allow GPUs to do larger amounts of mathematical operations. Those ALUs were included to allow quick geometric calculations which empower Games to have a high number of Frames per second. 
Bear in mind that CPUs do of course have some advantages over GPUs, but not when it comes to machine learning. An example of those advantages is access to memory. GPUs typically have access to only 8GB of memory, maybe 16 GB while CPUs can easily have access to more than that (depending on your RAM). Transfer operations to/from RAM are much quicker than that to/from GPUs (but this is only useful for frequent operations rather than long ones like training a model).
Now back to our original comparison, CPUs were originally built to handle several tasks at once rather than handling one complex task, things like running the operating system operations (like kernel operations). However, GPUs were built to do mathematical operations as quickly as possible since rendering graphics is all built on those simple mathematical operations.
The good news is that all of those geometrical 3D rendering operations are tensor operations, things like texture calculations, RGB rendering, and all of those are essentially matrix arithmetic operations. Moreover, all of those operations are in a floating-point format which makes it ideal for machine learning. In fact, the defacto for measuring the performance of the ML models is through Floating Point Operations Per Second (FLOPS).
Okay enough about GPUs vs CPUs, let’s dive into TPUs.
TPUs vs GPUs
Although TPUs and GPUs perform tensor operations, TPUs are more oriented in performing large tensor operations which are frequently  present in neural network training compared to 3D graphics rendering. If you are a systems enthusiast like myself, this wouldn’t be enough for you, you still want to find out more about the details!
A TPU v2 core is made of a Matrix Multiply Unit (MXU) which runs matrix multiplications and a Vector Processing Unit (VPU) for all other tasks such as activations, softmax, etc. The VPU handles float32 and int32 computations. The MXU on the other hand operates in a mixed precision 16–32 bit floating point format.
Google’s TPU core is made up of 2 units. A Matrix Multiply Unit and a Vector processing Unit as mentioned above. As for the software layer, an optimizer is used to switch between bfloat16 and bfloat32 operations (where 16 and 32 are the number of bits) so that developers wouldn’t need to change the code to switch between those operations. Obviously, bfloat16 uses less RAM but is less precise than bfloat32.
The end result is that the TPU systolic array architecture has a significant density and power advantage, as well as a non-negligible speed advantage over a GPU, when computing matrix multiplications.
Another interesting concept to speed up tensor operations is the so-called “systolic array”. Systolic means the contractions and release of a stream (typically used in medicine) . Systolic arrays are used to perform dot products of tensors on 1 core on a TPU instead of being spread out in parallel on multiple GPU cores. They can do that using multiply-accumulators that use bfloat16 operations on both matrices and bfloat32 on the result.
Another good point to note here is that when you are using Colab/Kaggle’s TPU you aren’t only using one TPU core, you are actually using quite a few. The gradients  are usually exchanged between TPU cores using the “all-reduce algorithm”
The last final bit I want to talk about that makes TPUs perform better than GPUs is quantization. Quantization is the process of approximating a random value between 2 limits. This is quite heavily used to compress floating-point calculations by converting continuous numbers to discrete numbers. This is quite interesting and you can find more about it here.
Final thoughts and takeaway
I hope this article gave some context about why chips like Apple’s neural engine and Colab’s TPUs perform well on ML model training. There is more innovation that goes into chips for ML too when those models have to go into mobile devices which I want to write an article about. I think we are always used to see innovation happening on the model side of things, but we forget that those models need those chips to run on.