Deploying Convolutional Neural Networks on Microcontrollers — a TinyML Blog

Nathan Bailey
9 min readApr 17, 2024

Most recently, I have been following the TinyML Cookbook by Gian Marco Iodice. It is a great book, teaching TinyML with small practical exercises and minimal theory. I’ve always found theory-heavy textbooks hard to follow, but this has just the right balance of theory and practical exercises to keep you engaged.

Inspired by my journey into TinyML, I embarked on a project to train a neural network on the CIFAR10 dataset and deploy it on an Arduino BLE Sense which contains a Cortex-M4 microcontroller. (The full code for this project can be found on GitHub)

There are 2 main considerations when deploying neural networks on microcontrollers: the model's size and its RAM usage. Generally speaking, we deal with 2 types of memory when looking at microcontrollers, program and data memory. Program memory is read-only (ROM) and stores the program and constant data. Data memory is volatile and used to store and read temporary data.

In inference mode, the model will have constant weights and biases, so it can be stored in ROM. However, the inputs and outputs of the model, as well as the intermediate tensors of the hidden layers are not known and have to be stored in RAM. We must therefore be mindful of the size of the model as well as the intermediate tensors the model produces.

Training the Network

We start by creating a simple Keras model to train on. I opted to implement a linear bottleneck layer from the EtinyNet network. For more details on this network, I have a separate blog on the network’s architecture and a blog on its implementation in PyTorch. However, the custom code for the block can be found below. It contains a depthwise convolutional layer, followed by a pointwise convolutional layer and another depthwise convolutional layer.

class LinearBottleneckBlock(keras.layers.Layer):
"""Custom Linear Bottleneck Layer from EtinyNet."""
def __init__(self, out_channels: int, kernel_size: int, padding: str = 'same', strides: int = 1, bias: bool = True) -> None:
super().__init__()
self.depthwise_conv_layer_a = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding=padding, strides=strides, use_bias=bias)
self.depthwise_a_batch_norm_layer = keras.layers.BatchNormalization()

self.pointwise_layer = keras.layers.Conv2D(out_channels, kernel_size=1, padding='same', strides=1, use_bias=bias)
self.pointwise_batch_norm = keras.layers.BatchNormalization()

self.depthwise_conv_layer_b = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding="same", strides=1, use_bias=bias)
self.depthwise_b_batch_norm_layer = keras.layers.BatchNormalization()

self.activation = keras.layers.Activation('relu')

def call(self, input_tensor: tf.Tensor, training: bool = True) -> tf.Tensor:
"""Forward Pass for the Linear Bottleneck Layer."""
depthwise_result = self.depthwise_a_batch_norm_layer(self.depthwise_conv_layer_a(input_tensor), training=training)
pointwise_result = self.activation(self.pointwise_batch_norm(self.pointwise_layer(depthwise_result), training=training))
output = self.activation(self.depthwise_b_batch_norm_layer(self.depthwise_conv_layer_b(pointwise_result), training=training))
return output

A network is created using this building block as seen below.

model = keras.Sequential([
keras.layers.Conv2D(filters = 16, kernel_size=5, padding='same', strides=1, input_shape=(32,32,3)),
keras.layers.BatchNormalization(),
keras.layers.Activation("relu"),
keras.layers.MaxPooling2D(pool_size=2),
LinearBottleneckBlock(out_channels=32, kernel_size=3),
LinearBottleneckBlock(out_channels=64, kernel_size=3),
keras.layers.MaxPooling2D(pool_size=2),
LinearBottleneckBlock(out_channels=64, kernel_size=3),
LinearBottleneckBlock(out_channels=128, kernel_size=3),
keras.layers.GlobalAveragePooling2D(),
keras.layers.Dropout(0.3),
keras.layers.Dense(units=10, kernel_regularizer=keras.regularizers.L2(1e-3))
])

Conveniently, the CIFAR10 dataset is built into the Keras library, so we can easily use it by calling the appropriate function.

(train_images, train_labels), (val_images, val_labels) = keras.datasets.cifar10.load_data()
train_images = train_images / 255.0
val_images = val_images / 255.0

We train the network using the Adam optimizer with sparse categorical cross-entropy loss. To ensure optimal training, we set the initial learning rate to 0.01 and use a learning rate scheduler to reduce it by a factor of 10 when the validation accuracy plateaus. We also use early stopping to stop the training process after we don’t see any improvement in validation accuracy for 8 epochs. This ensures we reduce overfitting.

loss_function = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
LEARNING_RATE = 0.01
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer, loss=loss_function, metrics=['accuracy'])

lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=0.1, patience=5, verbose=1, min_lr=0, min_delta=0.001)
early_stopping = keras.callbacks.EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=8, min_delta=0.001)

model.fit(
train_images,
train_labels,
epochs=100,
batch_size=32,
verbose=2,
validation_data=(val_images, val_labels),
callbacks = [lr_scheduler, early_stopping, tensorboard_cb]
)
model.save('cifar_classifier')

The training graph below shows we reach around 80% validation accuracy. There is a slight amount of overfitting. However, this project was more about the process of deploying this network to a microcontroller, rather than creating a perfect network, so this issue was parked to the side for now.

Accuracy vs Epoch

Preparing the model for the Microcontroller

The next step is to prepare the model for deployment on the Arduino. For microcontrollers, we use the TensorFlow-lite micro library, therefore, we must convert our model to a TFLite format. Unlike some architectures, such as the Google Coral TPU, we have no restrictions on quantizing the model. However, in doing so, we will use a considerably smaller amount of RAM/ROM as now the weights and the intermediate tensors will be stored as 8-bit integers rather than 32-bit floats.

We convert the model to TensorFlow lite using the following code below. As we are quantizing the model, we must provide a representative dataset to accurately estimate the quantization parameters. We can do this by inputting a subset of the training images.

cifar_ds = tf.data.Dataset.from_tensor_slices(train_images)
def representative_dataset_function() -> Generator[list, None, None]:
"""Create a representative dataset for TFLite Conversion."""
for input_value in cifar_ds.batch(1).take(100):
i_value_fp32 = tf.cast(input_value, tf.float32)
yield [i_value_fp32]

converter = tf.lite.TFLiteConverter.from_saved_model('cifar_classifier')
converter.representative_dataset = tf.lite.RepresentativeDataset(representative_dataset_function)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8


tflite_model = converter.convert()
with open("cifar_classifier.tflite", "wb") as f:
f.write(tflite_model)

We must check the size of our model too, we can do this by getting the size of the saved tensorflow model and printing out the result in KB. We retrieved a model size of 61KB which will easily fit in the 1MB of ROM on the Arduino

tflite_model_kb_size = os.path.getsize("cifar_classifier.tflite") / 1024
print(tflite_model_kb_size)

One important element to note here is that we have not used a softmax activation function in the final dense layer of our model. Instead, we use the raw outputs and use the from_logits=True option in our loss function. From experimentation, using a softmax activation function with quantization breaks the model accuracy when we attempt to dequantize the output to find the final model prediction. Most likely because the softmax function restricts the output values to the range (0, 1.0) which is too restrictive for quantization.

We can test the accuracy of our converted model by writing a simple function. This takes in data from our validation set, along with quantization parameters. We quantize the data for the model and dequantize the output before checking if our prediction matches the label. We then output an accuracy metric.

def classify_sample_tflite(interpreter: tf.lite.Interpreter, input_d: dict, output_d: dict, i_scale: np.float32, o_scale: np.float32, i_zero_point: np.int32, o_zero_point: np.int32, input_data: np.ndarray) -> tf.Tensor:
"""Classify an example in TFLite."""
input_data = input_data.reshape((1,32,32,3))
input_fp32 = tf.cast(input_data, tf.float32)
input_int8 = tf.cast((input_fp32 / i_scale) + i_zero_point, tf.int8)
interpreter.set_tensor(input_d["index"], input_int8)
interpreter.invoke()
output_int8 = interpreter.get_tensor(output_d["index"])[0]
output_fp32 = tf.convert_to_tensor((output_int8 - o_zero_point) * o_scale, dtype=tf.float32)
return output_fp32
tflite_interpreter = tf.lite.Interpreter(model_content = tflite_model)
tflite_interpreter.allocate_tensors()

input_details = tflite_interpreter.get_input_details()[0]
output_details = tflite_interpreter.get_output_details()[0]

input_quantization_details = input_details["quantization_parameters"]
output_quantization_details = output_details["quantization_parameters"]
input_quant_scale = input_quantization_details['scales'][0]
output_quant_scale = output_quantization_details['scales'][0]
input_quant_zero_point = input_quantization_details['zero_points'][0]
output_quant_zero_point = output_quantization_details['zero_points'][0]

num_correct_examples = 0
for i_value, o_value in zip(val_images, val_labels):
output = classify_sample_tflite(tflite_interpreter, input_details, output_details, input_quant_scale, output_quant_scale, input_quant_zero_point, output_quant_zero_point, i_value)
if np.argmax(output) == o_value:
num_correct_examples += 1

print(f'Accuracy: {num_correct_examples/len(list(val_images))}')
Accuracy: 0.7647

We achieved a 76% accuracy score, only 4% less than our non-quantized model!

To convert the model so it is useable on the Arduino, we use the following terminal commands. This converts our TFLite model into a C header file, ensures it is marked as a constant so that it will be placed in ROM and aligns it to the 8-byte boundary.

xxd -i cifar_classifier.tflite > model.h 
sed -i 's/unsigned char/const unsigned char/g' model.h
sed -i 's/const/alignas(8) const/g' model.

To test this model on the Arduino, we will also write one of the images to a header file so that it can be loaded onto the microcontroller. To do this we need 2 extra functions, the first function takes an array of int8 values (an input image) and outputs this as a string. The second function takes this string and converts it into a c header file.

def array_to_str(data: np.ndarray) -> str:
"""Convert numpy array of int8 values to comma seperated int values."""
num_cols = 10
val_string = ''
for i, val in enumerate(data):
val_string += str(val)
if (i+1) < len(data):
val_string += ','
if (i+1) % num_cols == 0:
val_string += '\n'
return val_string
def generate_h_file(size: int, data: str, label: str) -> str:
"""Generate a c header with the string numpy data."""
str_out = 'int8_t g_test[] = '
str_out += '\n{\n'
str_out += f'{data}'
str_out += '};\n'
str_out += f'const int g_test_len = {size};\n'
str_out += f'const int g_test_label = {label};\n'
return str_out

To use these functions, we first convert our validation images into a pandas array and find the images with the label 6 which corresponds to the frog label (see below for an example).

imgs = list(zip(val_images, val_labels))
cols = ["Image", "Label"]
df = pd.DataFrame(imgs, columns=cols)
frog_samples = df[df['Label'] == 6]
(a very blurry) CIFAR10 Frog Image

We then loop through these examples, classifying each image and saving the first one that the model correctly predicts.

c_code = ""
for index, row in frog_samples.iterrows():
i_value = np.asarray(row['Image'].tolist(), dtype=np.float32)
o_value = np.asarray(row['Label'].tolist(), dtype=np.float32)
o_pred_fp32 = classify_sample_tflite(tflite_interpreter, input_details, output_details, input_quant_scale, output_quant_scale, input_quant_zero_point, output_quant_zero_point, i_value)

if np.argmax(o_pred_fp32) == o_value:
i_value_int8 = ((i_value / input_quant_scale) + input_quant_zero_point).astype(np.int8)
i_value_int8 = i_value_int8.ravel()
val_str = array_to_str(i_value_int8)
c_code = generate_h_file(i_value_int8.size, val_str, "6")

with open('input_linear_blocks.h', 'w', encoding='utf-8') as file:
file.write(c_code)

Deploying the network on the Arduino BLE Sense

Now that we have successfully prepared our model for the Arduino, we can start to deploy it on the board.

We first must import the tensorflow micro libraries, for the BLE sense, this has been prepared by the author of TinyML here and needs to be imported into the Arduino IDE: https://github.com/PacktPublishing/TinyML-Cookbook_2E/blob/main/ArduinoLibs/Arduino_TensorFlowLite.zip

A full guide on how this was prepared can be found here: https://github.com/PacktPublishing/TinyML-Cookbook_2E/blob/main/Docs/build_arduino_tflitemicro_lib.md

Once the library has been imported, we can include the necessary files.

#include <TensorFlowLite.h>
#include <tensorflow/lite/micro/all_ops_resolver.h>
#include <tensorflow/lite/micro/micro_interpreter.h>
#include <tensorflow/lite/micro/micro_log.h>
#include <tensorflow/lite/micro/system_setup.h>
#include <tensorflow/lite/schema/schema_generated.h>
#include "model_linear_blocks.h"
#include "input_linear_blocks.h"

Next, we create global variables for the model, TensorFlow interpreter, input and output tensors and the output quantization parameters.

const tflite::Model *model = nullptr;
tflite::MicroInterpreter *interpreter = nullptr;
TfLiteTensor *input = nullptr;
TfLiteTensor *output = nullptr;

float o_scale = 0.0f;
int32_t o_zero_point = 0;

We declare an area of memory (RAM) for the tensor arena. The tensor arena stores the input and output tensors as well as the intermediate tensors from the hidden layers of the model. We can estimate this from our model, but the best way of deciding the amount of memory is to allocate a large area and then read out the actual tensor arena size from the interpreter. We can then adjust the arena size as needed once this information is given. For this model, we will need 44KB of RAM.

constexpr int tensor_arena_size = 44000;
uint8_t* tensor_arena;

This is allocated in the setup function in the heap and aligned on a 16-byte boundary, for efficient memory access.

tensor_arena = new __attribute__((aligned(16))) uint8_t[tensor_arena_size];

In the setup function, we first load the model from the header file, this is achieved by calling the GetModel method.

model = tflite::GetModel(cifar_classifier_tflite);

We next declare the operations that the model uses. Instead of individually declaring the operations, we can use the AllOpsResolver which registers all the DNN operations supported by tflite-micro. However, by declaring only the operations that the model uses via the MicroMutableOpResolver we save on the program memory (ROM) used.

static tflite::MicroMutableOpResolver<7> resolver;
resolver.AddConv2D();
resolver.AddRelu();
resolver.AddDepthwiseConv2D();
resolver.AddMaxPool2D();
resolver.AddReshape();
resolver.AddFullyConnected();
resolver.AddMean();

We can then create the interpreter. From this, we can retrieve the input and output tensors as well as the output quantization parameters that we will need during model prediction.

static tflite::MicroInterpreter static_interpreter(
model,
resolver,
tensor_arena,
tensor_arena_size
);
interpreter = &static_interpreter;
interpreter->AllocateTensors();
input = interpreter->input(0);
output = interpreter->output(0);
const auto *o_quant = reinterpret_cast<TfLiteAffineQuantization*>(output->quantization.params);
o_scale = o_quant->scale->data[0];
o_zero_point = o_quant->zero_point->data[0];

As mentioned above we can read out the actual tensor arena size used by the model. This is achieved by calling the following function and reading it out on the serial console.

Serial.println(interpreter->arena_used_bytes());
07:40:34.375 -> 43212

Lastly, in the loop function, we copy the input image into the input tensor, invoke the interpreter and read the output tensor.

std::memcpy(tflite::GetTensorData<int8_t>(input), g_test, g_test_len);
interpreter->Invoke();
int32_t ix_max = 0;
float pb_max = 0;
int8_t* out_val = tflite::GetTensorData<int8_t>(output);

We find the maximum de-quantized value outputted from the model, which is our prediction. We output this value to the serial.

for (int32_t ix = 0; ix <= 10; ix++) {
int8_t o_val = out_val[ix];
float pb = ((float) o_val-o_zero_point) * o_scale;
if (pb > pb_max) {
ix_max = ix;
pb_max = pb;
}
}
Serial.println(ix_max);

As can be seen, we output 6, which is the correct and expected label, showing the correct operation of our model on the Arduino!

07:40:34.757 -> 6

Conclusion

In this blog, we demonstrated how to create a custom Keras network to classify samples from the CIFAR10 dataset. We then converted this to TFLite and successfully deployed this on an Arduino BLE sense rev 2.

The full code for this project can be found on GitHub

--

--

Nathan Bailey

FPGA Engineer @ Arm, Ex FPGA Engineer @ Intel, University of Warwick CSE Graduate, ML Enthusiast, Climber: https://www.nathanbaileyw.com