TinyML: The Inference Engine

TFLite Micro is not a simplified version of TFLite. It's a complete rewrite: no dynamic allocation, no OS dependencies, no standard library beyond <stdint.h> and <string.h>. Understanding its design explains most of the constraints you'll encounter.

The Architecture

TFLite Micro has three components you interact with:

Model           The flat binary (.tflite), stored in flash
                Immutable after flashing. Accessed via a pointer.

Op Resolver     A table that maps operator codes (e.g. ADD, CONV_2D)
                to their implementations. You choose what to register.

Interpreter     Reads the model graph, uses the resolver to find ops,
                allocates tensors in the arena, runs the graph.

The interpreter is not a runtime VM. It's a static graph runner: it reads the execution order from the model, iterates through the operations in that order, and calls each op's function pointer. No dynamic dispatch, no garbage collection.

The Tensor Arena in Detail

The tensor arena is a flat block of RAM that the interpreter carves up at AllocateTensors() time. It holds:

Input tensors       The values you write before Invoke()
Output tensors      The values you read after Invoke()
Intermediate        Activations from hidden layers (reused across ops
                    when the execution schedule allows it)
Internal scratch    Temporary buffers for specific ops (e.g., im2col for Conv)

The allocator uses an offline memory planning algorithm: it analyzes the graph, determines which tensors are live at the same time, and packs them to minimize total arena usage. This means the arena required is often less than the naive sum of all tensor sizes.

Finding the Right Arena Size

// Print arena usage after a successful AllocateTensors()
interpreter->AllocateTensors();
Serial.print("Arena used: ");
Serial.println(interpreter->arena_used_bytes());

Use this number plus a 256-byte safety margin as your kTensorArenaSize. The margin covers alignment padding that varies by platform.

For the gesture classifier from chapter 5:

Model:           ~10 KB (in flash)
Arena (dense):   ~2 KB
Arena (conv1d):  ~4 KB

For more complex models:

MobileNetV1 0.25 (96×96 grayscale):  ~50 KB arena
MobileNetV1 0.25 (96×96 RGB):        ~80 KB arena
Keyword spotter (LSTM, MFCC input):  ~20 KB arena

Op Resolvers

AllOpsResolver includes every supported operator. It's convenient during development but adds ~30 KB to the binary. For production, use MicroMutableOpResolver and register only the ops your model uses.

// Find which ops your model needs (run this once in Python)
// python3 -c "
// import flatbuffers
// from tensorflow.lite.python import schema_fb as schema
// data = open('model_int8.tflite', 'rb').read()
// model = schema.Model.GetRootAsModel(data, 0)
// subgraph = model.Subgraphs(0)
// ops = set()
// for i in range(subgraph.OperatorsLength()):
//     op = subgraph.Operators(i)
//     ops.add(model.OperatorCodes(op.OpcodeIndex()).BuiltinCode())
// print(ops)
// "

For a gesture classifier with Dense and ReLU:

static tflite::MicroMutableOpResolver<4> resolver;
resolver.AddFullyConnected();
resolver.AddRelu();
resolver.AddSoftmax();
resolver.AddReshape();  // if the model flattens an input

This reduces binary size by about 20-30 KB compared to AllOpsResolver. Worth doing before shipping.

The Interpreter Lifecycle

The interpreter has two phases: allocation and inference. Allocation happens once in setup(). Inference can happen repeatedly in loop().

Phase 1: Allocation (once):
  GetModel()             Read the flatbuffer header. ~10 µs.
  MicroInterpreter()     Build the op dispatch table. ~50 µs.
  AllocateTensors()      Plan memory, set up tensor views. ~1-5 ms.

Phase 2: Inference (every sample):
  Fill input_tensor      Write normalized features. ~0.1 ms for 300 floats.
  Invoke()               Run all ops in graph order. ~1-10 ms for small models.
  Read output_tensor     Copy or inspect the scores. ~0.01 ms.

Do not call AllocateTensors() inside loop(). It re-plans memory from scratch on every call, which is both slow and pointless for a static model.

Static Allocation Pattern

The interpreter and resolver must outlive any code that uses them. Use static local variables or file-scope globals. Putting them on the stack in a function causes silent memory corruption when the function returns.

// WRONG: interpreter is destroyed when setup() returns
void setup() {
  tflite::AllOpsResolver resolver;                    // stack; freed when setup() returns
  tflite::MicroInterpreter interpreter(...);          // stack; freed when setup() returns
  interpreter->AllocateTensors();
  // interpreter is now a dangling pointer
}

// CORRECT: static storage, survives for the lifetime of the program
static tflite::AllOpsResolver resolver;
static tflite::MicroInterpreter static_interpreter(
    model, resolver, tensor_arena, kTensorArenaSize);
tflite::MicroInterpreter* interpreter = &static_interpreter;

Reading Tensor Metadata

You can inspect the tensor's type, shape, and quantization parameters:

TfLiteTensor* t = interpreter->input(0);

Serial.print("Type: ");
Serial.println(t->type);   // kTfLiteFloat32 = 1, kTfLiteInt8 = 9

Serial.print("Dims: ");
for (int i = 0; i < t->dims->size; i++) {
  Serial.print(t->dims->data[i]);
  Serial.print(" ");
}
Serial.println();

Serial.print("Scale: ");
Serial.println(t->params.scale, 6);

Serial.print("Zero point: ");
Serial.println(t->params.zero_point);

Logging this in setup() during development catches type mismatches before they turn into mysterious wrong predictions.

Error Handling

TFLite Micro functions return TfLiteStatus, which is either kTfLiteOk or kTfLiteError. You should check every return value during development:

TfLiteStatus status = interpreter->AllocateTensors();
if (status != kTfLiteOk) {
  Serial.println("AllocateTensors failed");
  while (true);  // halt; no point continuing without a working model
}

In production firmware, you might want a watchdog reset instead of while (true), but halting is correct for debugging.

Invoke() can also fail, typically when an op encounters an unexpected input shape or type. Log the error and add the input tensor inspection code above to diagnose.

Profiling Inference Time

Measure inference duration on the device. On Cortex-M, micros() is accurate enough for this:

unsigned long t_start = micros();
interpreter->Invoke();
unsigned long t_end   = micros();

Serial.print("Inference: ");
Serial.print(t_end - t_start);
Serial.println(" µs");

For the gesture classifier on the Nano 33 BLE Sense (Cortex-M4F @ 64 MHz):

Dense model (300 → 32 → 16 → 3):   ~1.2 ms
Conv1D model (100×3 input):         ~3.5 ms

These are fast enough for 100 Hz inference without missing samples. For audio at 16 kHz with a spectrogram model, inference time matters more and chapter 11 covers optimizations.

Multi-Model Setups

Some applications run two models: a cheap "wakeup" model that detects activity and a more expensive "classifier" model that runs only when activity is detected. TFLite Micro supports this by maintaining two interpreters sharing the same arena (carefully) or using separate arenas.

The simpler approach is two separate tensor arenas:

constexpr int kWakeArenaSize = 2 * 1024;
constexpr int kClassArenaSize = 8 * 1024;
alignas(16) uint8_t wake_arena[kWakeArenaSize];
alignas(16) uint8_t class_arena[kClassArenaSize];

Each interpreter gets its own arena. The total RAM cost is kWakeArenaSize + kClassArenaSize + model sizes. Plan accordingly.

Next Steps

Continue to 09-sensor-integration.md to connect real sensor inputs to the inference pipeline and handle the preprocessing needed for audio and image inputs.