TinyML: Model Conversion | TinyML Tutorial

A trained Keras model is about 40 KB of float32 weights. After conversion and quantization, it becomes a 10 KB flat buffer of int8 values that TFLite Micro can read directly from flash.

The Conversion Pipeline

gesture_model.keras
        │
        ▼
tf.lite.TFLiteConverter           (remove training ops, optimize graph)
        │
        ▼
quantization                      (float32 weights → int8, activations → int8)
        │
        ▼
model.tflite                      (flat binary buffer, ~10 KB)
        │
        ▼
xxd -i model.tflite               (convert to C array)
        │
        ▼
model_data.cc / model_data.h      (embed in Arduino sketch)

Each step is small. The trickiest part is the quantization calibration.

Float32 Baseline Conversion

Start with a simple float conversion to confirm the pipeline works before attempting quantization.

# convert.py
import tensorflow as tf
import numpy as np

model = tf.keras.models.load_model("training/gesture_model.keras")

# Step 1: float32 conversion (no quantization)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_float = converter.convert()

with open("training/model_float.tflite", "wb") as f:
    f.write(tflite_float)

print(f"Float model: {len(tflite_float):,} bytes")

Run the converted model through the TFLite interpreter in Python to verify it produces the same outputs as the Keras model:

# verify_float_conversion.py
import numpy as np
import tensorflow as tf

X_sample = np.load("data/processed/X.npy")[:5].astype(np.float32)
mean      = np.load("training/mean.npy").astype(np.float32)
std       = np.load("training/std.npy").astype(np.float32)
X_norm    = (X_sample - mean) / std

# Keras predictions
keras_model = tf.keras.models.load_model("training/gesture_model.keras")
keras_preds = keras_model.predict(X_norm, verbose=0)

# TFLite predictions
interpreter = tf.lite.Interpreter(model_path="training/model_float.tflite")
interpreter.allocate_tensors()
input_idx  = interpreter.get_input_details()[0]["index"]
output_idx = interpreter.get_output_details()[0]["index"]

tflite_preds = []
for x in X_norm:
    interpreter.set_tensor(input_idx, x[np.newaxis, :])
    interpreter.invoke()
    tflite_preds.append(interpreter.get_tensor(output_idx)[0])

tflite_preds = np.array(tflite_preds)

max_diff = np.abs(keras_preds - tflite_preds).max()
print(f"Max prediction difference (float): {max_diff:.6f}")
# Should be < 0.001 for a simple dense model

If the max difference is below 0.001, the float conversion is clean.

Int8 Quantization (Post-Training)

Int8 quantization requires a calibration dataset: a representative sample of real inputs that the converter uses to measure activation ranges. Aim for 100 to 300 samples covering the full input distribution.

# continuation of convert.py

X_all  = np.load("data/processed/X.npy").astype(np.float32)
mean   = np.load("training/mean.npy").astype(np.float32)
std    = np.load("training/std.npy").astype(np.float32)
X_norm = (X_all - mean) / std

def representative_data_gen():
    for i in range(min(100, len(X_norm))):
        yield [X_norm[i:i+1]]   # shape: (1, 300)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations                = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset       = representative_data_gen
converter.target_spec.supported_ops   = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type        = tf.int8
converter.inference_output_type       = tf.int8

tflite_int8 = converter.convert()

with open("training/model_int8.tflite", "wb") as f:
    f.write(tflite_int8)

print(f"Int8 model: {len(tflite_int8):,} bytes")
print(f"Compression: {len(tflite_float) / len(tflite_int8):.1f}x")

Verifying Int8 Accuracy

The int8 model's inputs and outputs are now scaled integers, not raw floats. You need the quantization parameters (scale and zero point) to convert between the two domains.

# verify_int8.py
import numpy as np
import tensorflow as tf

X_val  = np.load("data/processed/X.npy")[:20].astype(np.float32)
y_val  = np.load("data/processed/y.npy")[:20]
mean   = np.load("training/mean.npy").astype(np.float32)
std    = np.load("training/std.npy").astype(np.float32)
X_norm = (X_val - mean) / std

interpreter = tf.lite.Interpreter(model_path="training/model_int8.tflite")
interpreter.allocate_tensors()

input_details  = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]

in_scale, in_zp   = input_details["quantization"]
out_scale, out_zp = output_details["quantization"]

correct = 0
for x, label in zip(X_norm, y_val):
    # Quantize the float input to int8
    x_q = np.round(x / in_scale + in_zp).clip(-128, 127).astype(np.int8)

    interpreter.set_tensor(input_details["index"], x_q[np.newaxis, :])
    interpreter.invoke()

    out_q = interpreter.get_tensor(output_details["index"])[0]
    # Dequantize to float for readability
    out_f = (out_q.astype(np.float32) - out_zp) * out_scale

    pred = np.argmax(out_f)
    if pred == label:
        correct += 1

print(f"Int8 accuracy: {correct / len(y_val):.2%}")

For well-behaved models on sensor data, expect less than 2% accuracy drop after int8 quantization. Larger drops (5%+) indicate the float model has activations with wide ranges that quantize poorly. QAT (chapter 11) fixes this.

Understanding the Scale and Zero Point

Each tensor in the int8 model has a scale factor and a zero point:

float_value = (int8_value - zero_point) × scale
int8_value  = round(float_value / scale) + zero_point

These are embedded in the .tflite file. You retrieve them via get_input_details()["quantization"]. The firmware chapter shows how TFLite Micro handles this automatically during inference.

Generating the C Array

# In the training/ directory
xxd -i model_int8.tflite > ../firmware/gesture_infer/model_data.cc

Then edit model_data.cc to:

Change the variable name from model_int8_tflite to g_gesture_model_data
Add const qualifiers so the array lives in flash, not RAM
Create the corresponding header

// model_data.cc (after editing)
#include "model_data.h"

const unsigned char g_gesture_model_data[] = {
  0x1c, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33,
  // ... rest of the bytes ...
};
const unsigned int g_gesture_model_data_len = 4216;

// model_data.h
#pragma once
extern const unsigned char g_gesture_model_data[];
extern const unsigned int  g_gesture_model_data_len;

The const qualifier is critical. Without it, the array goes into RAM (SRAM on AVR, or the data section on Cortex-M), consuming your scarce 256 KB. With const, the linker puts it in flash.

Automating the Full Pipeline

Once the steps are confirmed to work, wire them together in a single script:

# build.py: run this after collecting new data
import subprocess
import sys

steps = [
    ["python3", "preprocess.py"],
    ["python3", "train.py"],
    ["python3", "convert.py"],
    ["python3", "verify_int8.py"],
    ["bash", "-c",
     "xxd -i training/model_int8.tflite > firmware/gesture_infer/model_data.cc"],
]

for step in steps:
    print(f"\n{'='*60}")
    print(f"Running: {' '.join(step)}")
    result = subprocess.run(step, check=False)
    if result.returncode != 0:
        print(f"Step failed with exit code {result.returncode}")
        sys.exit(1)

print("\nBuild complete. Flash firmware/gesture_infer/ to the device.")

Common Pitfalls

Forgetting const on the C array. Symptoms: the sketch compiles but the device crashes or resets immediately after AllocateTensors(). The linker placed the 10 KB model in SRAM instead of flash, and it overflowed.

Using a non-representative calibration dataset. If your calibration set only covers one class, the quantization ranges are wrong and accuracy crashes. Use samples from all classes, as close to the real distribution as possible.

Not matching input quantization in firmware. If the model expects int8 inputs but you feed it raw float readings (or vice versa), inference produces garbage without any error. The input->type field in TFLite Micro tells you what the model expects.

Changing the normalization parameters after saving them. The mean and std saved during training must be the same values you embed in firmware. If you retrain with different data and forget to update the saved parameters, every input is wrong.

Next Steps

Continue to 07-deployment.md to embed the model in an Arduino sketch and run live inference on the device.