Back to Blog
Edge AIQuantizationONNXTensorRTMobile

Edge AI: Why I Think On-Device Inference Changes Everything

A practical guide to running AI models on edge devices — covering model compression, quantization, ONNX Runtime, TensorRT, mobile deployment, and the hardware trade-offs you actually need to think about.

Published 2026-01-20|10 min

Why Run Models on the Device?

For years, the default approach to AI inference has been simple: send data to the cloud, let a beefy server run the model, get the result back. It works, and you get virtually unlimited compute. But there's a cost — latency, a hard dependency on network connectivity, and real data-privacy headaches. Edge AI flips this around by running models right where the data lives: on smartphones, embedded boards, IoT sensors, or on-prem servers.

Some use cases simply can't tolerate a cloud round-trip. Autonomous vehicles need sub-millisecond reactions. Industrial inspection cameras sit in air-gapped factories with no internet. Medical wearables have to work in areas with zero coverage. In these scenarios, on-device inference isn't a nice-to-have — it's a hard requirement.

There's also the business angle. Running inference at the edge cuts your recurring cloud-compute bill and keeps sensitive data — biometric readings, patient records, proprietary sensor streams — on the device itself. That makes life a lot easier when you're dealing with GDPR, HIPAA, or similar regulations.

The best inference call is the one that never leaves the device. Every millisecond you save at the edge adds up to a better user experience and a lighter cloud bill.

Model Compression: Pruning, Distillation, and Quantization

Here's the thing: production neural networks often have hundreds of millions of parameters. That's way too big and too slow for resource-constrained hardware. Model compression is the umbrella term for techniques that shrink a model's memory footprint and compute cost while keeping accuracy as high as possible.

Pruning

Pruning removes weights or entire neurons that don't contribute much to the output. Unstructured pruning zeroes out individual weights, creating sparse matrices — but you need specialized kernels to actually benefit from the sparsity. Structured pruning takes a different path: it removes whole channels or attention heads, giving you a genuinely smaller dense model that runs faster on standard hardware without any special sparse-matrix support.

Knowledge Distillation

With knowledge distillation, you train a smaller "student" model to mimic the output distribution of a larger "teacher" model. The student doesn't just learn the hard labels — it also picks up the soft probabilities across all classes, capturing subtle inter-class relationships that raw training data doesn't explicitly teach. In practice, distillation can shrink a model by 4-10x with only a small accuracy hit, and the resulting student is a standard dense network that benefits from all the usual inference optimizations.

Quantization

Quantization reduces the numerical precision of weights and activations — typically from 32-bit float (FP32) down to 16-bit (FP16), 8-bit integer (INT8), or even 4-bit (INT4). Post-training quantization (PTQ) converts an already-trained model using a small calibration dataset. Quantization-aware training (QAT) goes further: it simulates low-precision math during training so the model learns to handle the reduced precision gracefully.

  • FP32 to FP16: roughly 2x memory savings with negligible accuracy loss on most architectures.
  • FP32 to INT8 (PTQ): about 4x memory savings, usually less than 1% accuracy drop if you calibrate properly.
  • FP32 to INT4 (QAT): up to 8x memory savings, but accuracy loss varies — you'll get the best results with quantization-aware fine-tuning.
  • Mixed precision: keep critical layers at higher precision while aggressively quantizing the less sensitive ones.

ONNX Runtime and TensorRT

Once you've compressed your model, the next question is: how do you run it efficiently on target hardware? Two of the most popular inference engines are ONNX Runtime and NVIDIA TensorRT. Both work with the ONNX (Open Neural Network Exchange) format — a vendor-neutral intermediate representation that PyTorch, TensorFlow, and most major frameworks can export to.

ONNX Runtime, maintained by Microsoft, is a cross-platform engine that supports CPU, CUDA GPU, DirectML, OpenVINO, and several other execution providers. This makes it a solid choice for anything from cloud VMs to ARM-based edge boards. TensorRT, on the other hand, is NVIDIA's specialized engine for their GPUs. It applies aggressive kernel fusion, layer auto-tuning, and precision calibration to squeeze every last bit of performance from the hardware.

python
import torch
import onnx
import onnxruntime as ort
from torchvision.models import mobilenet_v3_small

# 1. Export a PyTorch model to ONNX
model = mobilenet_v3_small(pretrained=True).eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "mobilenet_v3_small.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
    opset_version=17,
)

# 2. Validate the exported ONNX model
onnx_model = onnx.load("mobilenet_v3_small.onnx")
onnx.checker.check_model(onnx_model)

# 3. Run inference with ONNX Runtime
session = ort.InferenceSession(
    "mobilenet_v3_small.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
result = session.run(None, {"input": dummy_input.numpy()})
print("Predicted class:", result[0].argmax(axis=1))

Exporting MobileNetV3 to ONNX and running a quick inference with ONNX Runtime.

TensorRT takes things a step further by building a serialized engine that's tuned for a specific GPU architecture. The engine file isn't portable across GPU families, but the performance gains — often 2-5x over generic CUDA execution — make the extra build step well worth it for latency-critical deployments.

python
import tensorrt as trt
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_path: str, precision: str = "fp16"):
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, TRT_LOGGER)

    with open(onnx_path, "rb") as f:
        if not parser.parse(f.read()):
            for i in range(parser.num_errors):
                print(parser.get_error(i))
            raise RuntimeError("ONNX parsing failed")

    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)

    if precision == "fp16":
        config.set_flag(trt.BuilderFlag.FP16)
    elif precision == "int8":
        config.set_flag(trt.BuilderFlag.INT8)
        # INT8 requires a calibration dataset (omitted for brevity)

    serialized_engine = builder.build_serialized_network(network, config)
    with open("model.engine", "wb") as f:
        f.write(serialized_engine)
    print(f"TensorRT engine built with {precision} precision.")

build_engine("mobilenet_v3_small.onnx", precision="fp16")

Building a TensorRT engine from an ONNX model with FP16 precision.

Mobile Deployment: Core ML and TensorFlow Lite

Smartphones are the most widespread edge devices on the planet, and two frameworks dominate on-device inference here: Apple's Core ML for iOS and Google's TensorFlow Lite (TFLite) for Android.

Core ML is tightly integrated with Apple's Neural Engine, GPU, and CPU — it automatically picks the best compute unit at runtime. You can convert models from PyTorch or TensorFlow using Apple's coremltools package. It supports quantization down to INT8 and palettization (weight clustering), and Apple's latest chips are seriously impressive — the A17 Pro Neural Engine is rated at 35 TOPS.

TensorFlow Lite targets Android, embedded Linux, and microcontrollers. Its converter applies built-in optimizations like FP16 and INT8 quantization, operator fusion, and buffer sharing. What I find especially useful is the delegate system: you get a GPU delegate for mobile GPUs, an NNAPI delegate for Android's Neural Networks API, and a Hexagon delegate for Qualcomm DSPs. And if you're going really small, TensorFlow Lite for Microcontrollers can run models in as little as 16 KB of RAM.

Tip

If you're targeting both iOS and Android, consider keeping a single ONNX source model and converting to Core ML and TFLite separately. This saves you from maintaining two independent training pipelines and keeps things consistent across platforms.

Hardware Considerations: NPU, GPU, and CPU

Your choice of inference hardware has a huge impact on throughput, latency, and power consumption. Modern edge devices give you three compute targets, each with its own trade-offs.

  1. NPU (Neural Processing Unit): Purpose-built silicon for matrix operations. NPUs deliver the best TOPS-per-watt, but they support a narrower set of operators. When a layer isn't supported, it falls back to the CPU, which can create pipeline stalls. Examples: Apple Neural Engine, Google Edge TPU, Qualcomm Hexagon.
  2. GPU: Mobile and embedded GPUs (Adreno, Mali, Apple GPU) offer strong parallel throughput and broader operator support than NPUs. They're a great fit for models heavy on convolution or attention layers. Power consumption falls between NPU and CPU.
  3. CPU: The universal fallback — every operator runs on CPU, making it the most compatible target. Modern CPUs with NEON (ARM) or AVX-512 (x86) SIMD extensions can deliver reasonable throughput for smaller models. You'll want CPU inference when the model is modest and the device lacks a capable NPU or GPU.

In practice, many edge inference engines use a heterogeneous execution strategy — they split the model graph across multiple compute units. Latency-critical paths run on the NPU, while unsupported ops fall back to the GPU or CPU. Profiling tools like NVIDIA Nsight, Xcode Instruments, and Android GPU Inspector are essential for spotting bottlenecks in these mixed-execution scenarios.

Real-World Use Cases and Trade-Offs

Edge AI is already running at scale across many industries. Smartphone cameras use super-resolution and computational photography models in real time. Voice assistants handle keyword spotting and speech recognition entirely on-device. Automotive systems fuse camera, LiDAR, and radar data through on-board neural networks for object detection and path planning.

On the industrial side, predictive maintenance models analyze vibration-sensor data on factory equipment to catch bearing failures before they happen. Agricultural drones run crop-health classifiers on-board to make spraying decisions without any connectivity. Retail stores deploy edge vision models for shelf inventory tracking and autonomous checkout.

That said, edge deployment does come with trade-offs you need to think through carefully. Accuracy often takes a hit after aggressive compression — you should set minimum accuracy thresholds and test quantized models against golden-reference outputs. Rolling out model updates is harder too; you'll need solid over-the-air (OTA) infrastructure. Debugging gets trickier because edge devices offer much less observability than cloud environments. And the hardware landscape is fragmented: a model optimized for one chip might perform poorly on another, so per-target benchmarking is a must.

Key Takeaways

  1. Edge AI cuts cloud-inference latency, saves money, and keeps sensitive data on-device — but you need to invest in careful model optimization.
  2. Pruning, distillation, and quantization can shrink models by 4-8x with minimal accuracy loss when you apply them thoughtfully.
  3. ONNX gives you a portable intermediate format; ONNX Runtime and TensorRT are the go-to engines for cross-platform and NVIDIA-specific deployment.
  4. Core ML and TensorFlow Lite own mobile inference, each with hardware-aware acceleration and built-in quantization support.
  5. Choosing between NPU, GPU, and CPU means weighing operator coverage, throughput, and power consumption. Always profile on your target hardware.
  6. Real-world edge deployments need OTA update infrastructure, accuracy validation after compression, and per-target benchmarking across the fragmented hardware ecosystem.

As NPU silicon gets more powerful and compiler toolchains mature, the gap between cloud and edge inference keeps shrinking. If you invest in edge-deployment expertise now, you'll be well-positioned to deliver faster, more private, and more resilient AI experiences across every device your users carry.