Optimizing Neural Networks— Weight Clustering Explained
Recently at work, I was introduced to an optimisation technique for neural networks called clustering. This optimisation technique works by clustering the weights of a neural network layer using the K-Means algorithm. In doing so, it reduces the storage needed for the model and greatly increases its compression rate allowing for quicker transfer size [1]. For my learning, I have summarized this optimisation method and included a small example for reference.
Clustering is best explained with an example. Let’s pick a classic dense neural network layer from a feed-forward network with 16 weights currently stored as 32-bit floats. First, we apply the k-means algorithm to the weights with a set number of clusters chosen by the user, let us use 4 here. K-means will find 4 distinct clusters and assign each weight to the nearest cluster. Each cluster will have a centroid (mean of the values of the cluster) which will be a 32-bit floating point number.
Next, each weight is given the value of its cluster. We replace the weight value with the index of its cluster. The complete process can be seen in the diagram below.
As seen, we started with 16 32-bit floats and have now reduced to 4 32-bit floats and 16 2-bit integers. We have reduced our storage needed from 512 bits to 160 bits, a 3x reduction! As you can imagine, keeping the number of clusters the same, as we scale this up to wider layers, the size reduction scales quite dramatically too! This now gives rise to enable storing the model in small TinyML devices.
In addition, specialised hardware can use clustering to improve inference speed and memory footprint [1]. Clustering also enables increased compression, e.g. compressing a tflite model as a zip file. This is because the resulting indexes stored in the weights are more likely to contain repeated values, making compression tools more effective [2]. This enables smaller transfer times when moving the model around.
One question you might have is how backpropagation is changed with a clustered model. Well for each cluster centroid, we sum the gradients for each weight in the cluster. This is shown in the equation below. We also have an additional step during the feed-forward phase, where we must look up the respective weight using the stored index [3].
Worked Example (adapted from [4])
We can cluster a model using the tensorflow model optimization library (tfmot). First, let us define our model.
We will use a simple model to tackle the MNIST dataset, a single convolutional layer and a dense layer are used with a max pooling layer.
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
model = keras.Sequential(
[
keras.layers.Reshape(target_shape=(28,28,1), input_shape=(28,28)),
keras.layers.Conv2D(filters=16, kernel_size=3, activation="relu"),
keras.layers.MaxPooling2D(pool_size=2),
keras.layers.Flatten(),
keras.layers.Dense(10, activation="softmax")
]
)
We initially trained this model to give us a baseline accuracy.
model.compile(
optimizer='sgd',
loss="sparse_categorical_crossentropy",
metrics=['accuracy']
)
model.summary()
model.fit(
train_images,
train_labels,
validation_split=0.2,
epochs=10
)
test_loss, test_accuracy = model.evaluate(
test_images,
test_labels
)
print(f'Baseline Test Loss: {test_loss}')
print(f'Baseline Test Accuracy: {test_accuracy}')
keras.models.save_model(model, 'baseline_model.h5', include_optimizer=False)
Then we proceed to cluster the model. We invoke the cluster_weights function from tfmot to cluster the model, passing the required parameters.
We must provide the required number of clusters (in this case 16) and the k-means centroid initialization scheme. I used the KMEANS_PLUS_PLUS initialization meaning the kmeans++ algorithm is used.
clustered_model = tfmot.clustering.keras.cluster_weights(
model,
number_of_clusters = 16 ,
cluster_centroids_init = tfmot.clustering.keras.CentroidInitialization.KMEANS_PLUS_PLUS
)
As an aside, an additional parameter could be passed in called cluster_per_channel. This enables clustering on a filter-by-filter basis, rather than for the whole kernel.
Next, we compile and fit the model as before. We can see from the test results that there is no reduction in accuracy or loss brought about by clustering.
clustered_model.compile(
optimizer=keras.optimizers.SGD(learning_rate=1e-5),
loss="sparse_categorical_crossentropy",
metrics=['accuracy']
)
clustered_model.summary()
clustered_model.fit(
train_images,
train_labels,
validation_split=0.2,
epochs=10
)
clustered_model_test_loss, clustered_model_test_accuracy = model.evaluate(
test_images,
test_labels
)
print(f'Clustered Model Test Loss: {clustered_model_test_loss}')
print(f'Clustered Model Test Accuracy: {clustered_model_test_accuracy}')
keras.models.save_model(final_model, 'clustered_model.h5', include_optimizer=False)
Baseline Test Loss: 0.1412425935268402
Baseline Test Accuracy: 0.9602000117301941
...
Clustered Model Test Loss: 0.13942870497703552
Clustered Model Test Accuracy: 0.9599999785423279
Before we use the model, we must call the strip clustering method on our model. This ensures that all wrappers and variables needed for clustering during training are removed from the model. We are restoring the original model, the only difference is that we will now have clustered weights.
final_model = tfmot.clustering.keras.strip_clustering(clustered_model)
A small helper function can be written which outputs the number of unique values in each weight. We only have 2 layers here, so we should only expect 2 values. We can see that we find 16 unique weights per layer indicating that clustering has occurred successfully.
for layer in final_model.layers:
for weight in layer.weights:
if 'kernel:0' in weight.name:
print(f"Number of clusters in weight {layer.name}/{weight.name} is {len(np.unique(weight))}")
Number of clusters in weight conv2d/conv2d/kernel:0 is 16
Number of clusters in weight dense/kernel:0 is 16
Finally, we convert the model to a tensorflow lite format and zip up each saved model, we can see that clustering provides us with a 5x reduction in compressed model size!
final_model_tflite = tf.lite.TFLiteConverter.from_keras_model(final_model).convert()
with open('clustered_model.tflite', 'wb') as f:
f.write(final_model_tflite)
def get_compressed_model_size(file, zipped_file):
with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
f.write(file)
return os.path.getsize(zipped_file)
print(f"Size of zipped baseline keras model: {get_compressed_model_size('baseline_model.h5', 'baseline_model_keras_zipped'):.2f}")
print(f"Size of zipped clustered keras model: {get_compressed_model_size('clustered_model.h5', 'clustered_model_keras_zipped'):.2f}")
print(f"Size of zipped clustered tflite model: {get_compressed_model_size('clustered_model.tflite', 'clustered_model_tflite_zipped'):.2f}")
Size of zipped baseline keras model: 103336.00
Size of zipped clustered keras model: 20173.00
Size of zipped clustered tflite model: 19989.00
The full code can be found on GitHub
In conclusion, we can see that clustering is an effective method for reducing the model size. It can be combined with other techniques such as pruning to enable greater reductions in size.
- https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html
- https://www.researchgate.net/publication/357618410_The_Effect_of_Model_Compression_on_Fairness_in_Facial_Expression_Recognition/fulltext/61d65d2cda5d105e551fdcad/The-Effect-of-Model-Compression-on-Fairness-in-Facial-Expression-Recognition.pdf
- https://arxiv.org/abs/1510.00149
- https://www.tensorflow.org/model_optimization/guide/clustering/clustering_example