Finding the Limits of TinyML — Deploying EtinyNet on a Cortex-M4

Nathan Bailey
14 min readApr 18, 2024

--

In my previous blog, I implemented a CNN trained on the CIFAR10 dataset and then deployed this on an Arduino BLE Sense, which is a microcontroller containing a Cortex-M4 CPU. This CPU contains 1MB of flash memory. However, the CNN that I deployed on it only had a size of 61KB. I wanted to see how far I could push the boundaries of this device and decided to implement EtinyNet-0.75 [1] on it.

The full code for this project can be found on GitHub.

Creating the Network

I have already implemented EtinyNet in PyTorch. I could have easily used this implementation and then used a 3rd party converter to convert this to a TFLite format. However, I wanted to improve my Keras knowledge, so I took up the challenge of implementing the 0.75 version of EtinyNet in Keras.

We first must create the building blocks for EtinyNet: the linearwise and the dense linearwise blocks. Creating these blocks is very similar to PyTorch and shown below. We simply inherit from the Keras layer class and ensure we implement the call method.

class LinearBottleneckBlock(keras.layers.Layer):
"""Custom Linear Bottleneck Layer from EtinyNet."""
def __init__(self, out_channels: int, kernel_size: int, padding: str = 'same', strides: int = 1, bias: bool = True) -> None:
super().__init__()
self.depthwise_conv_layer_a = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding=padding, strides=strides, use_bias=bias)
self.depthwise_a_batch_norm_layer = keras.layers.BatchNormalization()

self.pointwise_layer = keras.layers.Conv2D(out_channels, kernel_size=1, padding='same', strides=1, use_bias=bias)
self.pointwise_batch_norm = keras.layers.BatchNormalization()

self.depthwise_conv_layer_b = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding="same", strides=1, use_bias=bias)
self.depthwise_b_batch_norm_layer = keras.layers.BatchNormalization()

self.activation = keras.layers.Activation('relu')

def call(self, input_tensor: tf.Tensor, training: bool = True) -> tf.Tensor:
"""Forward Pass for the Linear Bottleneck Layer."""
depthwise_result = self.depthwise_a_batch_norm_layer(self.depthwise_conv_layer_a(input_tensor), training=training)
pointwise_result = self.activation(self.pointwise_batch_norm(self.pointwise_layer(depthwise_result), training=training))
output = self.activation(self.depthwise_b_batch_norm_layer(self.depthwise_conv_layer_b(pointwise_result), training=training))
return output
class DenseLinearBottleneckBlock(keras.layers.Layer):
"""Custom Dense Linear Bottleneck Layer from EtinyNet."""
def __init__(self, out_channels: int, kernel_size: int, padding: str = 'same', strides: int = 1, downsample: bool = False, bias: bool = True) -> None:
super().__init__()
self.depthwise_conv_layer_a = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding=padding, strides=strides, use_bias=bias)
self.depthwise_a_batch_norm_layer = keras.layers.BatchNormalization()

self.pointwise_layer = keras.layers.Conv2D(out_channels, kernel_size=1, padding='same', strides=1, use_bias=bias)
self.pointwise_batch_norm = keras.layers.BatchNormalization()

self.depthwise_conv_layer_b = keras.layers.DepthwiseConv2D(kernel_size=kernel_size, padding="same", strides=1, use_bias=bias)
self.depthwise_b_batch_norm_layer = keras.layers.BatchNormalization()

self.activation = keras.layers.Activation('relu')
self.downsample_layers = None
if downsample:
self.downsample_layers = keras.Sequential(
[
keras.layers.Conv2D(out_channels, kernel_size=1, padding='same', strides=strides, use_bias=True),
keras.layers.BatchNormalization()
]
)


def call(self, input_tensor: tf.Tensor, training: bool = True) -> tf.Tensor:
"""Forward Pass for the Dense Linear Bottleneck Layer."""
residual = input_tensor
depthwise_a_result = self.depthwise_a_batch_norm_layer(self.depthwise_conv_layer_a(input_tensor), training=training)
pointwise_result = self.activation(self.pointwise_batch_norm(self.pointwise_layer(depthwise_a_result), training=training))
depthwise_b_result = self.depthwise_b_batch_norm_layer(self.depthwise_conv_layer_b(pointwise_result), training=training)
if self.downsample_layers:
residual = self.downsample_layers(input_tensor, training=training)
output = self.activation(residual + depthwise_b_result)
return output

I originally wanted to implement the network in a manner analogous to PyTorch. That is to subclass the Keras Model class to create a custom network that way. This is a valid implementation and the code for this can be found in the GitHub project. However, looking at Keras implementations of MobileNet-v2 and ResNet a simple Python function style approach seems to be common for implementing deep neural networks, so I settled on this.

We first create a simple function to stack the linear or dense linear blocks together. This function takes in a list of block configurations, creates the blocks and passes data through it. Taking in a list of block configurations allows the structure of the network to easily be changed, this is very useful if we need to reduce the size of the network (spoiler alert). The stride is selected in the same way as the original paper: if it is the first block in the stack we select a stride of 2 to downsample the data, otherwise, we select a stride of 1 to keep the same spatial dimensions. When a dense linear block is used, we insert downsample layers where needed to ensure that the input to the block can be correctly added to the output of the block. Downsample layers are needed when the block has a stride of 2 or produces a different number of feature maps to its input.

def stack_linearwise_block(data: tf.Tensor, block_type: Type[Union[LinearBottleneckBlock, DenseLinearBottleneckBlock]], block_config: list[dict], initial_in_channels: int) -> tf.Tensor:
"""Stack Linear or Dense blocks together, pass through an input and return output."""
in_channels = initial_in_channels
for idx, config in enumerate(block_config):
out_channels = config.get("out_channels")
if not out_channels:
raise KeyError('key out_channels not found in block config')

padding='same'
if idx != 0:
stride=1
else:
stride=2

extra_args = {}
if block_type == DenseLinearBottleneckBlock:
extra_args = {'downsample':(stride == 2 or out_channels != in_channels)}

block = block_type(out_channels=out_channels, kernel_size=3, strides=stride, padding=padding, **extra_args)
in_channels = out_channels
data = block(data)

return data

A second function is created to stack these stacked blocks together. This is a simple function that creates the stacks of linear blocks and then passes data through it.

def create_stack(data: tf.Tensor, block_info: list[dict], initial_in_channels: int) -> tf.Tensor:
"""Stack blocks of blocks together, pass through an input and return output."""
in_channels = initial_in_channels
for block in block_info:
block_type = block.get('block_type')
if not block_type:
block_type = "lb"

layer_values = block.get('layer_values')

if not layer_values:
raise KeyError('key layer_values not found in block config')

data = stack_linearwise_block(data, block_type=LinearBottleneckBlock if block_type =='lb' else DenseLinearBottleneckBlock, block_config=layer_values, initial_in_channels=in_channels)

in_channels = layer_values[-1].get('out_channels')

if not in_channels:
raise KeyError('key out_channels not found in block config')
return data

We create the EtinyNet network with a final function, this creates the starting and final layers as well as invoking the previous function to create the stacked blocks. We return an instance of a Keras functional model.

def create_etinynet_model(i_shape: tuple, block_info: list[dict], initial_in_channels: int, output_units: int) -> keras.Model:
"""Create an EtinyNet model and return it."""
input_data = keras.layers.Input(shape=i_shape)

out = keras.layers.Conv2D(filters=initial_in_channels, kernel_size=3, strides=2)(input_data)
out = keras.layers.BatchNormalization()(out)
out = keras.layers.Activation('relu')(out)
out = create_stack(out, block_info=block_info, initial_in_channels=initial_in_channels)

out = keras.layers.GlobalAveragePooling2D()(out)
out = keras.layers.Dropout(rate=0.4)(out)
out = keras.layers.Dense(units=output_units)(out)

model = keras.Model(inputs=input_data, outputs=out)

return model

Using a list of dictionaries, we can specify the config for our EtinyNet-0.75 network. This is passed to the function to create the network.

etinynet_block_info = [
{
"block_type": "lb",
"layer_values": [{"out_channels": 24} for _ in range(4)]
},
{
"block_type": "lb",
"layer_values": [{"out_channels": 96} for _ in range(4)]
},
{
"block_type": "dlb",
"layer_values": [{"out_channels": 168} for _ in range(3)]
},
{
"block_type": "dlb",
"layer_values": [{"out_channels": 192} for _ in range(2)] + [{"out_channels": 384}]
}
]

i_shape = (224,224,3)
model = create_etinynet_model(i_shape, block_info=etinynet_block_info, initial_in_channels=24, output_units=num_train_classes)

Training the Network

We train the network on the Tiny-ImageNet dataset, which can be downloaded from: http://cs231n.stanford.edu/tiny-imagenet-200.zip. We use the Keras image_dataset_from_directory to build the training and validation datasets.

BATCH_SIZE=128
train_dataset = keras.preprocessing.image_dataset_from_directory(
'tiny-imagenet-200/train',
labels='inferred',
color_mode='rgb',
batch_size=BATCH_SIZE,
image_size=(224,224),
interpolation="bilinear",
shuffle=True,
seed=123,
)

To ensure that our data is in the same format as our PyTorch implementation we build the following preprocessing pipeline.

First, we randomly flip the image on the horizontal axis and then the data is scaled to be in the range of (0, 1). We finally normalize the data, passing through the training dataset to calculate the mean and standard deviation of the data.

rescale_layer = keras.layers.Rescaling(1./255)
rescaled_train_dataset = train_dataset.map(lambda x, y: (rescale_layer(x), y))
rescaled_train_dataset_data = rescaled_train_dataset.map(lambda data, _ : data)

augment_layer = keras.layers.RandomFlip('horizontal')
norm_layer = keras.layers.Normalization()
norm_layer.adapt(rescaled_train_dataset_data)

train_dataset = train_dataset.map(lambda x, y: (augment_layer(x), y))
train_dataset = train_dataset.map(lambda x, y: (rescale_layer(x), y))
train_dataset = train_dataset.map(lambda x, y: (norm_layer(x), y))

We train the network in the same way as we train the PyTorch implementation. We use an SGD optimizer, with a momentum of 0.9 and weight decay of 1e-4. We use a learning rate scheduler to drop the learning rate by a factor of 0.1 when the validation loss plateaus and employ early stopping to ensure we reduce the amount of overfitting.

loss_function = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
LEARNING_RATE = 0.1
optimizer = keras.optimizers.SGD(learning_rate=LEARNING_RATE, momentum=0.9, weight_decay=1e-4)
model.compile(optimizer=optimizer, loss=loss_function, metrics=[keras.metrics.SparseCategoricalAccuracy(name='Top-1 Accuracy'), keras.metrics.SparseTopKCategoricalAccuracy(k=5, name='Top-5 Accuracy')])

lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, min_lr=1e-7, min_delta=1e-4)
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', verbose=1, patience=10)

model.fit(
train_dataset,
epochs=1000,
validation_data=valid_dataset,
callbacks = [lr_scheduler, early_stopping, tensorboard_cb],
verbose=2
)

As seen in the graphs below we achieve similar results to the EtinyNet-1.0 PyTorch implementation. This is an encouraging result, showing the correct implementation of this network in Keras.

PyTorch Implementation of EtinyNet-1.0
Keras Implementation of EtinyNet-0.75

Preparing the Network for the Arduino

To prepare the network for deployment on the Arduino we first must convert it to TensorFlow Lite and quantize the model to INT-8. We do this using the following code below. To quantize the model, we must create a representative dataset so that the quantization parameters can correctly be estimated. We take 1000 samples of the training dataset as our representative dataset.

def representative_dataset_function() -> Generator[list, None, None]:
"""Create a representative dataset for TFLite Conversion."""
for input_value in normalized_train_dataset_data.rebatch(1).take(1000):
i_value_fp32 = tf.cast(input_value, tf.float32)
yield [i_value_fp32]

converter = tf.lite.TFLiteConverter.from_saved_model('etinynet')
converter.representative_dataset = tf.lite.RepresentativeDataset(representative_dataset_function)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("etinynet_int8.tflite", "wb") as f:
f.write(tflite_model)

We also output the size of the model using the following code, our model is 681KB, over ten times the size of our CIFAR10 classifier!

tflite_model_kb_size = os.path.getsize("etinynet_int8.tflite") / 1024
print(tflite_model_kb_size)
681.1171875

To evaluate this quantized model, we build the following function. This takes in our validation dataset and quantization parameters from our converted model. We quantize the data and run it through our model to extract the prediction. We can then calculate the top-1 and the top-5 accuracy scores. We achieve a top-1 score of 49% and a top-5 score of 76% showing no noticeable loss from converting and quantizing the model.

def classify_sample_tflite(interpreter: tf.lite.Interpreter, input_d: dict, output_d: dict, i_scale: np.float32, o_scale: np.float32, i_zero_point: np.int32, o_zero_point: np.int32, input_data: tf.Tensor) -> tf.Tensor:
"""Classify an example in TFLite."""
input_data = tf.reshape(input_data, (1,48,48,3))
input_fp32 = tf.cast(input_data, tf.float32)
input_int8 = tf.cast((input_fp32 / i_scale) + i_zero_point, tf.int8)
interpreter.set_tensor(input_d["index"], input_int8)
interpreter.invoke()
output_int8 = interpreter.get_tensor(output_d["index"])[0]
output_fp32 = tf.convert_to_tensor((output_int8 - o_zero_point) * o_scale, dtype=tf.float32)
return output_fp32
tflite_interpreter = tf.lite.Interpreter(model_content = tflite_model)
tflite_interpreter.allocate_tensors()

input_details = tflite_interpreter.get_input_details()[0]
output_details = tflite_interpreter.get_output_details()[0]

input_quantization_details = input_details["quantization_parameters"]
output_quantization_details = output_details["quantization_parameters"]
input_quant_scale = input_quantization_details['scales'][0]
output_quant_scale = output_quantization_details['scales'][0]
input_quant_zero_point = input_quantization_details['zero_points'][0]
output_quant_zero_point = output_quantization_details['zero_points'][0]

num_correct_examples = 0
num_examples = 0
num_correct_examples_top_5 = 0
for i_value, o_value in valid_dataset.unbatch():
output = classify_sample_tflite(tflite_interpreter, input_details, output_details, input_quant_scale, output_quant_scale, input_quant_zero_point, output_quant_zero_point, i_value)
if tf.cast(tf.math.argmax(output), tf.int32) == o_value:
num_correct_examples += 1
if tf.math.in_top_k(tf.expand_dims(o_value, axis=0), tf.expand_dims(output, axis=0), 5).numpy()[0]:
num_correct_examples_top_5 += 1
num_examples += 1

print(f'Top-1 Accuracy: {num_correct_examples/num_examples}')
print(f'Top-5 Accuracy: {num_correct_examples_top_5/num_examples}')
Top-1 Accuracy: 0.4891
Top-5 Accuracy: 0.7607

To deploy the model on the Arduino, we need to save it in a format that it can understand. To do this, we run the following terminal commands. This takes our tensorflow lite model and converts it to a C header file. We ensure that it is marked as a constant so that it will be placed in ROM and aligned it to the 8-byte boundary.

xxd -i etinynet_int8.tflite > model.h 
sed -i 's/unsigned char/const unsigned char/g' model.h
sed -i 's/const/alignas(8) const/g' model.h

We also need a test image for the Arduino. We can create this using 2 functions. The first function takes in an input image (an array of 8-bit values) and outputs it as a string. The second function converts this string into a C header format.

def array_to_str(data: np.ndarray) -> str:
"""Convert numpy array of int8 values to comma seperated int values."""
num_cols = 10
val_string = ''
for i, val in enumerate(data):
val_string += str(val)
if (i+1) < len(data):
val_string += ','
if (i+1) % num_cols == 0:
val_string += '\n'
return val_string
def generate_h_file(size: int, data: str, label: str) -> str:
"""Generate a c header with the string numpy data."""
str_out = 'int8_t g_test[] = '
str_out += '\n{\n'
str_out += f'{data}'
str_out += '};\n'
str_out += f'const int g_test_len = {size};\n'
str_out += f'const int g_test_label = {label};\n'
return str_out

We select a subset of images that have the label 115. These are passed through the model and we save the first one that is correctly predicted as a C header file.

filtered_valid_dataset = valid_dataset.unbatch().filter(lambda _, y: y == 115)
c_code = ""
for i_value, o_value in filtered_valid_dataset:
o_pred_fp32 = classify_sample_tflite(tflite_interpreter, input_details, output_details, input_quant_scale, output_quant_scale, input_quant_zero_point, output_quant_zero_point, i_value)
if tf.cast(tf.math.argmax(o_pred_fp32), tf.int32) == o_value:
i_value_int8 = tf.cast(((i_value / input_quant_scale) + input_quant_zero_point), tf.int8).numpy()
i_value_int8 = i_value_int8.ravel()
val_str = array_to_str(i_value_int8)
c_code = generate_h_file(i_value_int8.size, val_str, "115")

with open('input_imagenet.h', 'w', encoding='utf-8') as file:
file.write(c_code)

Deploying the network on the Arduino

As before, to deploy the network on the Arduino we first must import the tensorflow micro libraries. For the BLE sense, this has been prepared by Gian Marco Iodice, the author of TinyML and needs to be imported into the Arduino IDE: https://github.com/PacktPublishing/TinyML-Cookbook_2E/blob/main/ArduinoLibs/Arduino_TensorFlowLite.zip

Once the libraries have been imported, we can include the required files.

#include <TensorFlowLite.h>
#include <tensorflow/lite/micro/all_ops_resolver.h>
#include <tensorflow/lite/micro/micro_interpreter.h>
#include <tensorflow/lite/micro/micro_log.h>
#include <tensorflow/lite/micro/system_setup.h>
#include <tensorflow/lite/schema/schema_generated.h>
#include "model.h"
#include "input_imagenet.h"

Next, we declare global variables for the model, interpreter, quantization parameters and the input and output tensors.

const tflite::Model *model = nullptr;
tflite::MicroInterpreter *interpreter = nullptr;
TfLiteTensor *input = nullptr;
TfLiteTensor *output = nullptr;
float o_scale = 0.0f;
int32_t o_zero_point = 0;

We must specify an amount of RAM called the tensor arena space. This space stores the input and output tensors as well as the intermediate tensors from the hidden layers of the model. We can estimate this by looking at the model, but the best way of specifying this is to input a large enough value so that the program successfully runs and then find out how much tensor arena space is being used by the interpreter.

constexpr int tensor_arena_size = 63000;
uint8_t* tensor_arena;

We allocate this memory in the setup function, making sure to use the new keyword so it is allocated in the heap. We also align it on a 16-byte boundary, for efficient memory access.

tensor_arena = new __attribute__((aligned(16))) uint8_t[tensor_arena_size];

The next step in the setup function is to get the model, we do this by calling the GetModel function and passing through the name of the model in the header file.

model = tflite::GetModel(etinynet_int8_tflite);

We then declare a MicroMutableOpResolver object, this is used by the interpreter to register and access the operations that are used by the model. We could have used an AllOpsResolver object instead, this registers all the DNN operations supported by tflite-micro. However, it is more memory-efficient (it saves on program memory) to declare only the operations that we actually use.

static tflite::MicroMutableOpResolver<7> resolver;
resolver.AddConv2D();
resolver.AddRelu();
resolver.AddDepthwiseConv2D();
resolver.AddMaxPool2D();
resolver.AddReshape();
resolver.AddFullyConnected();
resolver.AddMean();

Our TfLite interpreter has been declared next, we pass through the model, resolver and tensor arena. From the interpreter, we can find our input and output tensors as well as the quantization parameters for the output of our model.

  static tflite::MicroInterpreter static_interpreter(
model,
resolver,
tensor_arena,
tensor_arena_size
);
interpreter = &static_interpreter;
input = interpreter->input(0);
output = interpreter->output(0);
const auto *o_quant = reinterpret_cast<TfLiteAffineQuantization*>(output->quantization.params);
o_scale = o_quant->scale->data[0];
o_zero_point = o_quant->zero_point->data[0];

As stated above, once we have instantiated the interpreter, we can find out the tensor arena size in the following way and output the value to the serial monitor.

Serial.println(interpreter->arena_used_bytes());

In our loop function, we copy our input image into the input tensor, invoke the interpreter and retrieve the output. We then loop through each output value finding the prediction.

std::memcpy(tflite::GetTensorData<int8_t>(input), g_test, g_test_len);
interpreter->Invoke();
int32_t ix_max = 0;
float pb_max = 0;
int8_t* out_val = tflite::GetTensorData<int8_t>(output);
for (int32_t ix = 0; ix <= 200; ix++) {
int8_t o_val = out_val[ix];
float pb = ((float) o_val-o_zero_point) * o_scale;
if (pb > pb_max) {
ix_max = ix;
pb_max = pb;
}
}
Serial.println(ix_max);

Issues Encountered

As I expected, the program did not work the first time.

The first issue encountered was that the model did not fit in program memory, we used 102% of the 1MB of flash storage. To deal with this issue, I removed one of the final dense linear blocks, this enabled our model to fit in the flash memory, using 96% of it.

Sketch uses 1007376 bytes (102%) of program storage space. Maximum is 983040 bytes.
Global variables use 197080 bytes (75%) of dynamic memory, leaving 65064 bytes for local variables. Maximum is 262144 bytes.

The next issue was that still the program would not run, it would get stuck in the setup function. I narrowed this down to 2 issues. First, the RAM utilization of the model was far too high. Edge Impulse, which can be used to estimate the RAM usage of models on various devices estimated the RAM usage to be over 500KB, and we only have 256KB of total RAM to use.

As mentioned above, part of the RAM is used to store the values of the intermediate tensors of our model. Compared to our CIFAR10 model, we use input images that are of size 224x224 rather than 32x32. We also produce a greater number of feature maps compared to the CIFAR10 model. Both of these characteristics contribute to much larger intermediate output tensors. They are far too large for this device and need to be dealt with.

It was very hard to debug what was going on in the Arduino program. The first pointer that we had a RAM issue was the fact that the reported tensor arena size used was 6KB, this was far too small for a model of this size. Secondly, the model got stuck around the time when it was allocating tensors and copying data into the input tensors, again showing that there was a RAM issue present.

To solve this problem, I reduced the input size of our images to 48x48. This helped reduce the RAM size of the model such that it could fit within the constraints of our device.

Still, the model could not run. I diagnosed this problem to the use of the dense linear blocks. After removing these and replacing them with linear blocks, the model ran and outputted the correct prediction.

20:01:59.143 -> 115

The main issue with these changes was that reducing the input image size to 48x48 resulted in a huge drop in validation accuracy. This can be seen by comparing the training graphs below. To combat this issue, I employed transfer learning. I initially trained the model with 224x224 inputs and slowly reduced the spatial dimensions of the input image, retraining the model after each reduction. The code segment below shows this in action. This was an original piece of work that I had not seen before in any research. It builds on top of the existing transfer learning approaches commonly seen.

def train_model_input_size(spatial_image_size: tuple[int], input_model: keras.Model | None = None) -> keras.Model:
"""Train a model with different input size after reading in previous weights."""
train_dataset, valid_dataset, _ = create_dataset(batch_size=BATCH_SIZE, image_size=spatial_image_size)
i_shape = tuple(list(spatial_image_size) + [3])

new_model = create_etinynet_model(i_shape, block_info=etinynet_block_info, initial_in_channels=24, output_units=200)
if input_model:
new_model.set_weights(input_model.get_weights())
new_model.summary(expand_nested=True)

new_model = train_model(keras_model=new_model, learning_rate=0.1, t_dataset=train_dataset, v_dataset=valid_dataset, epochs=100)
new_model.save('etinynet_'+str(spatial_image_size[0]))
return new_model
input_tuples = [(224, 224), (112, 112), (96, 96), (64, 64), (48, 48)]
model = None
for idx, input_tuple in enumerate(input_tuples):
model = train_model_input_size(input_tuple, input_model=model)

This resulted in a less drastic drop in validation accuracy. If we reduce the input images to 48x48 from the start, we have a top-5 accuracy of 52%. However, if we use this transfer learning approach, we now get a top-5 accuracy of 65%, a 13% increase!

Accuracy graphs for 224x224 input size
Accuracy graphs for 48x48 input size
Accuracy graphs when slowly reducing the input size to 48x48

Conclusions

In this blog, I talked through the process of creating and deploying a modified EtinyNet-0.75 network on an Arduino BLE Sense microcontroller. It was a challenging project, with its ups and downs, but it was very rewarding to work through them and successfully deploy the model.

The full code for this project can be found on GitHub.

--

--

Nathan Bailey

MSc AI and ML Student @ ICL. Ex ML Engineer @ Arm, Ex FPGA Engineer @ Arm + Intel, University of Warwick CSE Graduate, Climber. https://www.nathanbaileyw.com