TinyML: Deployment | TinyML Tutorial

Flashing a model to the device is five lines of firmware code wrapped in a fair amount of setup. Get the setup right once and the rest follows.

The Firmware Structure

A TFLite Micro inference sketch has four responsibilities:

Load the model from flash
Allocate the tensor arena and run AllocateTensors()
On each inference: fill the input tensor, call Invoke(), read the output tensor
Act on the result (serial print, GPIO, BLE notification, etc.)

The gesture classifier sketch below covers all four.

Complete Gesture Classifier Sketch

// gesture_infer/gesture_infer.ino
#include <Arduino_LSM6DSOX.h>
#include <TensorFlowLite.h>
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "model_data.h"

// --- Configuration ---
const char* LABELS[]     = {"punch", "flex", "idle"};
const int   NUM_CLASSES  = 3;
const int   SAMPLE_RATE  = 100;   // Hz
const int   WINDOW_SIZE  = 100;   // samples
const int   NUM_FEATURES = WINDOW_SIZE * 3;  // aX, aY, aZ per sample

// Normalization parameters (must match training/mean.npy, training/std.npy)
// Replace these with your actual values from numpy output
const float FEATURE_MEAN[NUM_FEATURES] = { /* ... 300 floats ... */ };
const float FEATURE_STD[NUM_FEATURES]  = { /* ... 300 floats ... */ };

// --- TFLite setup ---
constexpr int kTensorArenaSize = 8 * 1024;
alignas(16) uint8_t tensor_arena[kTensorArenaSize];

const tflite::Model*       tfl_model       = nullptr;
tflite::MicroInterpreter*  interpreter     = nullptr;
TfLiteTensor*              input_tensor    = nullptr;
TfLiteTensor*              output_tensor   = nullptr;

// Quantization parameters (read at setup)
float in_scale  = 1.0f;
int   in_zp     = 0;
float out_scale = 1.0f;
int   out_zp    = 0;

// --- Sample buffer ---
float sample_buffer[NUM_FEATURES];
int   buffer_pos = 0;
bool  buffer_ready = false;

void setup() {
  Serial.begin(115200);
  while (!Serial);

  // Init IMU
  if (!IMU.begin()) {
    Serial.println("IMU init failed");
    while (true);
  }

  // Load model
  tfl_model = tflite::GetModel(g_gesture_model_data);
  if (tfl_model->version() != TFLITE_SCHEMA_VERSION) {
    Serial.println("Model schema version mismatch");
    while (true);
  }

  // Create interpreter
  static tflite::AllOpsResolver resolver;
  static tflite::MicroInterpreter static_interpreter(
      tfl_model, resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;

  if (interpreter->AllocateTensors() != kTfLiteOk) {
    Serial.println("AllocateTensors failed: increase kTensorArenaSize");
    while (true);
  }

  input_tensor  = interpreter->input(0);
  output_tensor = interpreter->output(0);

  // Read quantization params
  in_scale  = input_tensor->params.scale;
  in_zp     = input_tensor->params.zero_point;
  out_scale = output_tensor->params.scale;
  out_zp    = output_tensor->params.zero_point;

  Serial.println("Ready");
}

void loop() {
  if (!IMU.accelerationAvailable()) return;

  float ax, ay, az;
  IMU.readAcceleration(ax, ay, az);

  // Fill the sample buffer column by column (aX0, aY0, aZ0, aX1, aY1, aZ1, ...)
  int base = buffer_pos * 3;
  sample_buffer[base + 0] = ax;
  sample_buffer[base + 1] = ay;
  sample_buffer[base + 2] = az;
  buffer_pos++;

  if (buffer_pos < WINDOW_SIZE) return;

  // Buffer is full; run inference
  buffer_pos   = 0;
  buffer_ready = true;

  // Normalize and quantize each feature, write to input tensor
  for (int i = 0; i < NUM_FEATURES; i++) {
    float norm = (sample_buffer[i] - FEATURE_MEAN[i]) / FEATURE_STD[i];
    int8_t q   = (int8_t)constrain(
        round(norm / in_scale) + in_zp, -128, 127);
    input_tensor->data.int8[i] = q;
  }

  if (interpreter->Invoke() != kTfLiteOk) {
    Serial.println("Invoke failed");
    return;
  }

  // Find the predicted class
  float max_score = -1e9f;
  int   max_class = -1;
  for (int i = 0; i < NUM_CLASSES; i++) {
    float score = (output_tensor->data.int8[i] - out_zp) * out_scale;
    if (score > max_score) {
      max_score = score;
      max_class = i;
    }
  }

  Serial.print("Prediction: ");
  Serial.print(LABELS[max_class]);
  Serial.print("  (");
  Serial.print(max_score, 3);
  Serial.println(")");
}

Embedding the Normalization Parameters

The FEATURE_MEAN and FEATURE_STD arrays need the actual values from your training run. A small Python script generates the Arduino initializer syntax:

# gen_norm_header.py
import numpy as np

mean = np.load("training/mean.npy")
std  = np.load("training/std.npy")

def fmt_array(name, arr):
    vals = ", ".join(f"{v:.6f}f" for v in arr)
    return f"const float {name}[{len(arr)}] = {{{vals}}};"

with open("firmware/gesture_infer/norm_params.h", "w") as f:
    f.write("#pragma once\n\n")
    f.write(fmt_array("FEATURE_MEAN", mean) + "\n")
    f.write(fmt_array("FEATURE_STD",  std)  + "\n")

Replace the placeholder arrays in the sketch with #include "norm_params.h" and remove the manual declarations.

Sizing the Tensor Arena

kTensorArenaSize = 8 * 1024 is a guess. To find the actual minimum, TFLite Micro provides a helper (available in recent versions):

// After AllocateTensors():
size_t used = interpreter->arena_used_bytes();
Serial.print("Arena used: ");
Serial.print(used);
Serial.println(" bytes");

Set kTensorArenaSize to used + 512 as a safety margin. Oversizing wastes RAM; undersizing causes AllocateTensors() to fail silently (or crash).

On the Nano 33 BLE Sense, 8 KB is enough for the gesture model from chapter 5. For a MobileNetV1 (image), expect 200-300 KB.

Compiling and Flashing

# From the repo root
arduino-cli compile \
    --fqbn arduino:mbed_nano:nano33ble \
    firmware/gesture_infer/

arduino-cli upload \
    --fqbn arduino:mbed_nano:nano33ble \
    -p /dev/ttyACM0 \
    firmware/gesture_infer/

arduino-cli monitor -p /dev/ttyACM0 --config baudrate=115200

Once running, perform a punch gesture and watch the Serial Monitor. The output should look like:

Ready
Prediction: idle   (0.874)
Prediction: idle   (0.912)
Prediction: punch  (0.967)
Prediction: idle   (0.843)

ESP32-S3 Notes

On the ESP32-S3, the FQBN and sensor library differ:

arduino-cli compile \
    --fqbn esp32:esp32:esp32s3 \
    firmware/gesture_infer/

The tensor arena can be larger (ESP32-S3 has 512 KB RAM), so you can afford a bigger model. Also, the ESP32-S3 has vector instructions that accelerate dot products (the core operation in dense layers). TFLite Micro uses them automatically when compiled with -O3 -mfpu=fp-armv8, which is the default for ESP32 Arduino builds.

For sensors on ESP32-S3, you'll wire an external MPU6050 or LSM6DS3 over I2C:

#include <Adafruit_MPU6050.h>
Adafruit_MPU6050 mpu;

void setup() {
  Wire.begin(SDA_PIN, SCL_PIN);   // set your pin assignments
  mpu.begin();
  mpu.setAccelerometerRange(MPU6050_RANGE_8_G);
}

Debugging Inference

Model outputs all the same value. Check the input quantization. If in_scale is 0.0 (happens when you load a float model as if it were int8), every input quantizes to the zero point and the model predicts uniform probability.

AllocateTensors() returns kTfLiteError. The tensor arena is too small. Double kTensorArenaSize and try again. Print arena_used_bytes() after a successful allocation on a larger arena to find the exact minimum.

Predictions are correct but inverted. Your label array order doesn't match the output neuron order. Fix the LABELS array, not the model.

Serial output is garbled. Baud rate mismatch. The monitor must match the Serial.begin() argument.

Confidence Thresholding

Rejecting low-confidence predictions avoids acting on noise:

const float CONFIDENCE_THRESHOLD = 0.70f;

if (max_score < CONFIDENCE_THRESHOLD) {
  Serial.println("Uncertain: ignoring");
  return;
}

Set the threshold by testing with real sensor data and examining the score distribution. Idle periods near boundaries between gestures often produce scores around 0.5. A threshold of 0.7 usually filters them out without rejecting clear positives.

Next Steps

Continue to 08-inference-engine.md for a deeper look at how TFLite Micro allocates memory, resolves operations, and handles the interpreter lifecycle.