Use Quantization aware training of TensorFlow’s model optimization toolkit to create four times smaller models that do not suffer a drop in results.
I recently had an article on different TensorFlow libraries, and one of them was TensorFlow’s Model Optimization Toolkit.
The model optimization toolkit provides pruning, quantization, and weight clustering techniques to reduce the size and latency of models. Quantization can be performed both during and after training and converts the models to use 8-bit integers instead of the 32-bit floating-point integers. However, quantization is a lossy process. TfLite models are also quantized, due to which they are not as accurate as the original models. To solve this issue, quantization aware training can be used. It converts the weights to int-8 while training before converting it back to 32-bit float, so it acts like noise for the models forcing them to learn accordingly.
So in the rest of the article, this is what we will do. We will generate TfLite models after both quantizing and without quantizing and then compare them based on their sizes and accuracies.
- Requirements
- Creating quantization aware models
- Converting them to TfLite
- Results
The TensorFlow model optimization toolkit needs to be installed along with the normal TensorFlow distribution. They can be pip installed using the following statements:
pip install tensorflow
pip install -q tensorflow-model-optimization
To use quantization aware training, the models need to be wrapped in the tfmot.quantization
class. The whole model can be wrapped, or you can wrap certain layers that you want to. It is suggested to train the models first and then apply fine-tuning using the wrapped model; otherwise, the model does not perform very well. I will discuss the minimum required section necessary in this article, but this post can be referred to for a detailed readthrough.
Create a simple model using Keras TensorFlow with any of the Sequential or Model methods. Below, I have given an example of a straightforward model created for the MNIST dataset using the Model method and trained it for 20 epochs.
inp = tf.keras.layers.Input(shape=(28, 28, 1))
x = tf.keras.layers.Conv2D(64, kernel_size = (3, 3), padding = 'same', activation='relu')(inp)
x = tf.keras.layers.Conv2D(32, kernel_size = (3, 3), padding = 'same', activation='relu')(x)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Conv2D(16, kernel_size = (3, 3), padding = 'same', activation='relu')(x)
x = tf.keras.layers.Dropout(0.25)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(10)(x)model = tf.keras.models.Model(inputs=inp, outputs=x)model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, metrics=['accuracy'])model.fit(train_images, train_labels, epochs=20, validation_split=0.1, batch_size=500)
To convert this model to use quantization aware training:
import tensorflow_model_optimization as tfmot
quantize_model = tfmot.quantization.keras.quantize_model
q_aware_model = quantize_model(model)q_aware_model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, metrics=['accuracy'])q_aware_model.fit(train_images_subset, train_labels_subset, batch_size=500, epochs=20, validation_split=0.1)
Here’s how their histories look:
Also, comparing their accuracies on the test set, this is what the results look like.
Without Quantization: 98.930%
With Quantization: 99.000%
We get a performance improvement after quantization! Although this won’t always be the case, the important thing to note is that we did not suffer from a drop in performance. However, there is an issue with this technique. Many layers are not supported, including some even some basic ones like batch normalization, add, concatenate, global average pooling, etc. Instead of converting the whole model, only some layers can be quantized, which would also mean that you could skip quantizing those layers that reduce the accuracy the most.
It is also suggested in that TensorFlow article that it is better to try quantizing the later layers instead of the first layers and avoiding quantizing critical layers like attention mechanisms. Let’s see how we might have approached the code if we only wanted to do the Dense layers. (The below code section is copied straight from this TensorFlow article)
import tensorflow_model_optimization as tfmot
quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layermodel = tf.keras.Sequential([
...
# Only annotated layers will be quantized.
quantize_annotate_layer(Conv2D()),
quantize_annotate_layer(ReLU()),
Dense(),
...
])
# Quantize the model.
quantized_model = tfmot.quantization.keras.quantize_apply(model)
For this article, the completely quantized model will be used.
To convert the models to TFLite, a TFLite converter needs to be created to pass the model. The optimization needs to be set for the quantized model to tell the TFLite converter to use int8 instead of floating-point values.
converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()converter = tf.lite.TFLiteConverter.from_keras_model(model)
baseline_tflite_model = converter.convert()
By saving these models, their size is compared, and the quantized model is four times smaller than its float counterpart. If the optimization was also applied to the float model, its size is comparable to the quantization aware training model. Still, it will suffer a further loss in results to the float model.
Quant TFLite test_accuracy: 0.9901
Baseline TFLite test_accuracy: 0.9886
Quant TF test accuracy: 0.9900000095367432
TfLite models are extremely useful for edge applications and if you are training a model from scratch with the vision of converting it to a TfLite model then quantization aware training is the way to go. Not only will it resolve the issue of having a highly lower accuracy than its base model, but the model will also be smaller.