TinyML: Optimization | TinyML Tutorial

A model that works on your laptop in float32 but barely fits on the device in int8 needs a second pass. Optimization is not one technique; it's a sequence of tradeoffs.

The Optimization Budget

Before changing anything, establish baselines:

# measure_model.py
import tensorflow as tf
import numpy as np
import time

model = tf.keras.models.load_model("training/gesture_model.keras")

X_test = np.load("data/processed/X.npy")[:100].astype(np.float32)
mean   = np.load("training/mean.npy").astype(np.float32)
std    = np.load("training/std.npy").astype(np.float32)
X_norm = (X_test - mean) / std

# Latency
t_start = time.perf_counter()
for _ in range(100):
    model.predict(X_norm[:1], verbose=0)
t_end = time.perf_counter()
print(f"Float32 latency (CPU): {(t_end - t_start) / 100 * 1000:.2f} ms")

# Size
total_bytes = sum(w.numpy().nbytes for w in model.trainable_weights)
print(f"Float32 size: {total_bytes / 1024:.1f} KB")
print(f"Int8 estimate: {total_bytes / 4 / 1024:.1f} KB")

Then measure on the device using micros() as shown in chapter 8. Record both desktop and on-device numbers. They diverge: the desktop measures theoretical throughput; the device is what ships.

Technique 1: Choose a Smaller Architecture

The most effective optimization is picking the right model size from the start. A model with 10,000 parameters will always be faster and smaller than one with 100,000. If accuracy is acceptable at 10K, stop there.

For gesture classification with 3 classes and 60 training samples, the dense model from chapter 5 (10K params) is already close to optimal. The tools below are for when you've hit accuracy limits and need more capacity than the budget allows.

Technique 2: Post-Training Quantization (PTQ)

Covered in chapter 6. The expected result:

Float32:  40 KB, 1.2 ms on Cortex-M4F
Int8 PTQ: 10 KB, 0.4 ms on Cortex-M4F (hardware multiply)

PTQ should always be the first optimization pass. It's free accuracy-wise for most sensor models (less than 1-2% loss) and gives a 4x size reduction.

Technique 3: Quantization-Aware Training (QAT)

When PTQ drops accuracy more than you can accept, QAT recovers it by simulating quantization noise during training.

# qat_training.py
import tensorflow as tf
import tensorflow_model_optimization as tfmot

# Load the already-trained float model
base_model = tf.keras.models.load_model("training/gesture_model.keras")

# Apply QAT annotations
qat_model = tfmot.quantization.keras.quantize_model(base_model)
qat_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

# Fine-tune for a small number of epochs, not from scratch
qat_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=30,
    batch_size=16,
    verbose=2,
)

# Convert to int8 TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(qat_model)
converter.optimizations                = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops   = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type        = tf.int8
converter.inference_output_type       = tf.int8
tflite_qat = converter.convert()

with open("training/model_qat.tflite", "wb") as f:
    f.write(tflite_qat)

print(f"QAT model: {len(tflite_qat):,} bytes")

Install tensorflow-model-optimization first:

pip install tensorflow-model-optimization

QAT typically recovers 1-3% accuracy compared to PTQ, at the cost of an extra training run.

Technique 4: Reducing Op Resolver Size

Switching from AllOpsResolver to MicroMutableOpResolver reduces the firmware binary size (not the model size):

// Only register ops the model actually uses
static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddFullyConnected();
resolver.AddRelu();
resolver.AddSoftmax();
resolver.AddReshape();
resolver.AddQuantize();

To know which ops to register, inspect the model:

# List ops using the TFLite schema (Python)
python3 - <<'EOF'
import flatbuffers
from tflite.Model import Model as TFLModel

data   = open("training/model_int8.tflite", "rb").read()
model  = TFLModel.GetRootAsModel(data, 0)
sg     = model.Subgraphs(0)
codes  = {model.OperatorCodes(sg.Operators(i).OpcodeIndex()).BuiltinCode()
          for i in range(sg.OperatorsLength())}
print("Op codes:", codes)
EOF

Cross-reference the code numbers against tensorflow/lite/schema/schema_generated.h to get the names. For a dense model: typically codes 9 (FULLY_CONNECTED), 19 (RELU), 25 (RESHAPE), 21 (SOFTMAX).

Expected binary savings: 20-40 KB depending on what AllOpsResolver included.

Technique 5: Reducing Inference Latency

For latency-sensitive applications (audio at 16 kHz needs inference under 10 ms per frame), the options are:

Optimize the architecture. Fewer layers, smaller layers, depthwise-separable convolutions instead of standard Conv2D.

Enable CMSIS-NN. On Cortex-M4F and above, TFLite Micro uses ARM's CMSIS-NN library for optimized SIMD implementations of dense layers and convolutions. It's enabled automatically when compiling with the right flags:

# Arduino build with CMSIS-NN enabled (add to platform.local.txt)
compiler.c.extra_flags=-DARM_MATH_CM4 -DARM_MATH_MATRIX_CHECK

CMSIS-NN provides 2-4x speedup for dense layers on Cortex-M4F compared to the generic implementation.

Reduce the input feature size. If MFCC uses 40 mel bins, try 20. If the window is 100 samples, try 50 with a smaller hop. Each reduction in input size reduces the first layer's operations quadratically.

Technique 6: Pruning

Pruning removes weights close to zero by setting them exactly to zero and relying on a compressed storage format. TFLite Micro does not yet support sparse execution, so pruning only helps if followed by a subsequent compression step (zip or custom encoding in flash).

For most embedded targets in 2026, pruning is not worth the complexity. PTQ + architecture reduction gives better results with less effort. Revisit this when your target supports sparse inference.

Profiling On-Device

A systematic profiling approach for on-device timing:

// profile.ino
unsigned long t[5];

void loop() {
  t[0] = micros();

  // Read sensor
  float ax, ay, az;
  IMU.readAcceleration(ax, ay, az);
  t[1] = micros();

  // Fill input tensor (normalize + quantize)
  fill_input_tensor(ax, ay, az);
  t[2] = micros();

  // Inference
  interpreter->Invoke();
  t[3] = micros();

  // Read output
  int pred = read_prediction();
  t[4] = micros();

  Serial.print("sensor:");     Serial.print(t[1] - t[0]);
  Serial.print(" preprocess:"); Serial.print(t[2] - t[1]);
  Serial.print(" inference:");  Serial.print(t[3] - t[2]);
  Serial.print(" output:");     Serial.print(t[4] - t[3]);
  Serial.println(" µs");
}

Typical breakdown for gesture classifier on Nano 33 BLE Sense:

sensor:      180 µs
preprocess:  210 µs
inference:  1200 µs
output:        5 µs

Preprocessing is often as expensive as inference for small models. Vectorizing the normalize-and-quantize loop (or using CMSIS-DSP) can halve it.

Power Optimization

For battery-powered devices, the biggest savings come from duty cycling, not from model size:

// Wake on accelerometer interrupt, run inference, sleep
#include "nrf_power.h"  // nRF52840 (Nano 33 BLE Sense)

void loop() {
  // Check if motion threshold exceeded (hardware interrupt)
  if (!motion_detected()) {
    // Go to low-power wait
    sd_app_evt_wait();   // Nordic SoftDevice low-power wait
    return;
  }

  collect_window();
  run_inference();
  act_on_result();
}

With the Nano running inference at 5 Hz (200 ms intervals), average current drops from ~22 mA (continuous) to ~4 mA. For audio keyword spotting, running a cheap energy-detector first to gate the MFCC+model reduces average current by 10-20x.

Common Pitfalls

Optimizing before profiling. You don't know where the time goes until you measure it. The inference is often not the bottleneck; the preprocessing is.

Comparing latency across hardware. "2 ms" means nothing without the CPU speed. Always include the hardware specs when reporting numbers.

Applying QAT to an architecture that changes between float and QAT training. Dropout layers behave differently during QAT. Test that the QAT model accuracy matches expectations before doing the full conversion.

Next Steps

Continue to 12-best-practices.md for patterns, anti-patterns, and what to do when the model works in the lab but fails in the field.