TinyML: Best Practices

Most TinyML failures are not ML failures. The model predicts correctly; something else is wrong.

The Most Common Failure Modes

Training-Deployment Mismatch

The model sees different data at inference time than it saw during training. This kills accuracy faster than any model architecture choice.

Sources of mismatch:

SourceWhat goes wrongFix
Different sensorNoise profile, axis orientation differCollect data from the deployment sensor
Wrong normalizationTraining mean/std not saved or not appliedGenerate norm_params.h from mean.npy, std.npy
Different sample rate100 Hz in training, 50 Hz at inferenceMatch exactly; document it in the code
Sensor mountingBoard rotated 90° between collection and deploymentFix the mount or re-collect with the correct orientation
Environmental conditionsIndoor vs outdoor noise, temperature affecting sensorCollect under deployment conditions

Model Overconfidence

A softmax output of [0.98, 0.01, 0.01] looks authoritative but is meaningless if the input is outside the training distribution. The model has never seen "the device is sitting on a table motionless" and will confidently classify it as one of its three gestures.

Mitigate with a confidence threshold (chapter 7) and an explicit "unknown" class:

# Add an "idle" or "unknown" class to training
# Collect samples of the device doing nothing, held still, resting flat
# Include them as a class in the dataset
LABELS = ["punch", "flex", "idle"]  # idle = "none of the above"

A properly trained idle class soaks up borderline inputs. Set the threshold at 0.70 or higher. Below that, report "uncertain" rather than the top prediction.

Quantization Surprise

Post-training quantization works well for most sensor models but occasionally drops accuracy by 5-10%. This is always fixable, not a dead end.

Systematic diagnosis:

# 1. Check float model accuracy (should be your target)
keras_preds = keras_model.predict(X_test_norm, verbose=0)

# 2. Check float TFLite accuracy (should match keras within 0.5%)
float_tflite_preds = run_tflite("training/model_float.tflite", X_test_norm)

# 3. Check int8 TFLite accuracy (acceptable delta: 0-2%)
int8_tflite_preds = run_tflite_int8("training/model_int8.tflite", X_test_norm)

print(f"Keras:       {accuracy(keras_preds, y_test):.2%}")
print(f"Float TFLite:{accuracy(float_tflite_preds, y_test):.2%}")
print(f"Int8 TFLite: {accuracy(int8_tflite_preds, y_test):.2%}")

If float TFLite matches Keras but int8 is much worse, the problem is quantization. Fix with QAT (chapter 11). If float TFLite also drops, the problem is the conversion, not quantization.

Firmware Patterns

Keep Inference Separate from I/O

Do not mix inference with Serial printing, BLE notifications, or display updates. I/O can block, and blocking during the sample-collection window causes missed samples.

// WRONG: Serial inside the inference window
void loop() {
  collect_sample();
  if (window_full()) {
    run_inference();
    Serial.println(result);   // can take 1-2 ms; misses next sample
  }
}

// RIGHT: double-buffer, print asynchronously
volatile int pending_class = -1;

void loop() {
  collect_sample();
  if (window_full()) {
    run_inference();
    pending_class = result;   // set flag
  }
  if (pending_class >= 0) {
    Serial.println(LABELS[pending_class]);
    pending_class = -1;
  }
}

Version Your Models

Every firmware build should embed the model version. When you retrain and redeploy, you need to know which model is on which device, especially when debugging field failures.

// model_data.h
#pragma once
#define GESTURE_MODEL_VERSION "v1.2.0"
#define GESTURE_MODEL_DATE    "2026-05-10"

extern const unsigned char g_gesture_model_data[];
extern const unsigned int  g_gesture_model_data_len;

Log the version on boot:

void setup() {
  Serial.print("Model: ");
  Serial.print(GESTURE_MODEL_VERSION);
  Serial.print(" (");
  Serial.print(GESTURE_MODEL_DATE);
  Serial.println(")");
}

Fail Visibly, Not Silently

Every error path should be observable. A while (true) halt is better than silently wrong predictions.

// Halt with a diagnostic blink code
void fatal(int blinks) {
  while (true) {
    for (int i = 0; i < blinks; i++) {
      digitalWrite(LED_BUILTIN, HIGH); delay(200);
      digitalWrite(LED_BUILTIN, LOW);  delay(200);
    }
    delay(1000);
  }
}

// In setup():
if (interpreter->AllocateTensors() != kTfLiteOk) fatal(3);  // 3 blinks = arena error
if (input_tensor->type != kTfLiteInt8)            fatal(4);  // 4 blinks = type error

Three blinks means "look at AllocateTensors". Four blinks means "type mismatch". You can decode it without a serial monitor, which matters when debugging in the field.

Data Collection Patterns

The 80/20 Split Must Be Temporal, Not Random

If you collect 100 gesture samples and split 80/20 randomly, the validation set contains samples from the same session, same person, same conditions as training. Validation accuracy will be inflated.

Split by collection session instead:

# Temporal split: validate on last session, train on all earlier
sessions = sorted(os.listdir("data/raw/punch"))
train_sessions = sessions[:-4]
val_sessions   = sessions[-4:]

This gives a more honest estimate of how the model generalizes to new recordings.

Label During Collection, Not After

It is tempting to collect raw data and label it later. The result is a labeling session where you're guessing which of 50 CSV files contains a punch vs a flex. Label immediately by passing the label as an argument to the collection script, as in chapter 4.

Collect Edge Cases

Models fail at boundaries. After your initial round of data collection:

  1. Collect slow, deliberate gestures (the model may miss them)
  2. Collect fast, jerky gestures (the model may misclassify them)
  3. Collect the device at unusual orientations
  4. Have a second person collect data if the model needs to generalize

Add these to the training set and retrain. One iteration of boundary-case collection often raises production accuracy more than a week of architecture tuning.

Deployment Checklist

Before shipping firmware to a device that will be in the field:

□ Model version string in firmware matches convert.py output
□ Normalization parameters match training/mean.npy and training/std.npy
□ kTensorArenaSize set to arena_used_bytes() + 256, not a guess
□ AllOpsResolver replaced with MicroMutableOpResolver (for production)
□ Confidence threshold tested against real edge-case data
□ Serial logging disabled or guarded by a DEBUG flag
□ Fatal error blink codes documented in the hardware spec
□ Firmware tested at 3.3V (not just USB 5V)
□ Model accuracy on held-out session matches validation accuracy

When the Model Is Right but the Product Is Wrong

Accuracy is a model metric. Usefulness is a product metric. A 95%-accurate gesture classifier is still annoying if it triggers false positives twice per minute during idle use.

Measure your product experience, not just your model:

False positive rate:  how often does the model activate when it shouldn't?
False negative rate:  how often does the model miss a real gesture?
Latency:              how long from gesture to response?
Power consumption:    how long does the battery last?

Set targets for these before you start collecting data. The targets shape every decision downstream.

Staying Current

The TinyML field moves fast. As of 2026, the key areas in flux:

Hardware. Dedicated ML accelerators (like the Syntiant NDP120 used in the Arduino Nicla Voice) are appearing in the microcontroller price range. They run inference at much lower power than a Cortex-M with TFLite Micro.

Frameworks. MicroAI, ExecuTorch, and ONNX Runtime Micro are challenging TFLite Micro for embedded inference. The workflow is similar; the deployment target and performance differ.

Model architectures. Temporal convolutional networks and small transformers are replacing LSTMs for audio classification at the edge. They fit better in the op-set that TFLite Micro implements efficiently.

Track Pete Warden's blog, the TFLite Micro GitHub issues, and the Edge Impulse blog for current state. The abstractions in this tutorial are stable; the specific library versions and hardware recommendations will shift.

Where to Go From Here

Projects to build first:

  1. Gesture-controlled LED or servo (accelerometer classifier, 3 classes)
  2. Keyword spotter ("on"/"off" or custom word) using the microphone
  3. Anomaly detector on an industrial motor (vibration baseline + outlier detection)
  4. Wake-word plus action (chain two models: cheap energy detector + classifier)

Related tutorials in this wiki:

  • content/esp32/ for hardware design and connectivity context
  • content/raspberry-pi/ for more powerful edge inference when a microcontroller isn't enough
  • content/python/ for filling gaps in the Python side of the pipeline

Books worth reading:

  • TinyML by Pete Warden and Daniel Situnayake (O'Reilly, 2019). The canonical reference, though some API details are dated.
  • AI at the Edge by Daniel Situnayake and Jenny Plunkett (O'Reilly, 2023). More current coverage of Edge Impulse and production deployment.

Community:

  • TensorFlow Lite Micro GitHub: github.com/tensorflow/tflite-micro
  • Edge Impulse forum: forum.edgeimpulse.com
  • Hackster TinyML projects for inspiration and real-world code